Page MenuHomePhabricator

Introduced iterative bytecode execution.
ClosedPublic

Authored by sfuniak on Aug 23 2021, 5:19 AM.

Details

Summary

This is commit 2 of 4 for the multi-root matching in PDL, discussed in https://llvm.discourse.group/t/rfc-multi-root-pdl-patterns-for-kernel-matching/4148 (topic flagged for review).

This commit implements the features needed for the execution of the new operations pdl_interp.get_accepting_ops, pdl_interp.choose_op:

  1. The implementation of the generation and execution of the two ops.
  2. The addition of Stack of bytecode positions within the ByteCodeExecutor. This is needed because in pdl_interp.choose_op, we iterate over the values returned by pdl_interp.get_accepting_ops until we reach finalize. When we reach finalize, we need to return back to the position marked in the stack.
  3. The functionality to extend the lifetime of values that cross the nondeterministic choice. The existing bytecode generator allocates the values to memory positions by representing the liveness of values as a collection of disjoint intervals over the matcher positions. This is akin to register allocation, and substantially reduces the footprint of the bytecode executor. However, because with iterative operation pdl_interp.choose_op, execution "returns" back, so any values whose original liveness cross the nondeterminstic choice must have their lifetime executed until finalize.

Testing: pdl-bytecode.mlir test

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
rriddle added inline comments.Oct 12 2021, 12:23 AM
mlir/lib/Rewrite/ByteCode.cpp
457–460

That is... weird. Can you file an llvm bug and reference it here?

586–601
612
1514–1520

llvm::reverse(*block)?

1514–1520

This feels expensive, I'm also not sure this is correct. If we move the insertion point to the operands, we may end up creating a side effecting operation incorrectly. For example, what happens in the situation where:

%cst = some.constant
%alloc = some.alloc
...
%load = some.load
... <Current Insertion Point>

If the current insertion point is the same above, if we were about to create a some.store %cst in %alloc operation wouldn't this incorrectly create it above the load (thus leading to a miscompile)?

1752–1754

Why not grab the users from all of the values?

1760–1763
1772–1773

Yeah, the opRangeMemory should hold the actual storage. memory should hold the address to the data.

This revision now requires changes to proceed.Oct 12 2021, 12:23 AM
sfuniak added inline comments.Oct 12 2021, 7:42 PM
mlir/lib/Rewrite/ByteCode.cpp
1514–1520

I can see how this change could be problematic. It was mostly introduced, because the current insertion point is at the root of the pattern, and with downward edge traversal, there is no guarantee that all the inputs are defined at this point. However, now that the pdl.rewrite accepts an optional argument forcing the root, this is less of an issue, because the user can manually specify the desired root that's known to work. Furthermore, in our use case, we rely on graph region where the order does not matter anyway. I will revert this change.

General question: what's the cost if I never use multi-root patterns?

General question: what's the cost if I never use multi-root patterns?

If you do not use multi-root patterns, the increase in cost of PDL-to-PDLInterp lowering is negligible (the system will scan through the pattern to determine the candidate roots, which there will be exactly 1). There is no increase in the cost of the bytecode execution, because there will be no pdl_interp.foreach operations in the resulting PDLInterp IR.

sfuniak added inline comments.Oct 14 2021, 10:03 PM
mlir/lib/Rewrite/ByteCode.cpp
314–315

Actually, blockToAddr is always needed to resolve the forward references. It's moved here from
void Generator::generate(ModuleOp module).
What you probably refer to is addrToBlock below. I could guard that one, but as I indicated in my earlier comment, it may be better to determine at runtime whether this map is to be populated, rather than compile time.

mlir/lib/Rewrite/ByteCode.h
186–187

I am a little apprehensive about wrapping this with NDEBUG in a public header. Doing so would break the code if the users of this header file are compiled with the different setting of the NDEBUG flag as the ByteCode.cpp. From llvm/lib/Support/Debug.cpp:

Even though LLVM might be built with NDEBUG, define symbols that the code
built without NDEBUG can depend on via the llvm/Support/Debug.h header.

Furthermore, eventually, I would like to have a debugger-like functionality in ByteCode, so that you can execute the operations one by one. So in the long run, we may want to control at runtime whether or not to populate this map.

rriddle added inline comments.Oct 14 2021, 10:07 PM
mlir/lib/Rewrite/ByteCode.h
186–187

It's not a public header though? This can't be included outside fo lib/Rewrite. I'd also like to avoid any kind of debugging facilities being baked into release builds, that type of functionality should be reserved for compilation setups that want it.

sfuniak added inline comments.Oct 14 2021, 10:15 PM
mlir/lib/Rewrite/ByteCode.h
186–187

I suppose you are right, it's not a public header. We access the bytecode only through FrozenRewritePatternSet. I will do as you suggested.

sfuniak updated this revision to Diff 380907.Oct 20 2021, 5:16 AM

Addressed review feedback and made further improvements:

  1. Fixed the insertion point on pdl_interp.create_operation.
  2. Fixed the implementation of pdl_interp.get_users to match its semantics specified in stacked diff #1.
  3. Eliminated the extra map storing the block for each instruction address. Instead, we store the line number of the instruction directly in the bytecode when debug messages are turned on.
  4. pdl_interp.foreach no longer consumes the range being iterated over. Instead, we store the index into the range and update this index every time we call pdl_interp.continue.
  5. PDLByteCodeMutableState::opRangeMemory now stores an owning reference to each range.
  6. Improved unit tests coverage.

I am sorry, I am not sure why my recent changes to PDLInterps files are showing up here. It's as if the base commit was not set right. I did update the stack diff #1. Any ideas?

sfuniak updated this revision to Diff 380912.Oct 20 2021, 5:37 AM

Attempt to fix a rebase problem.

sfuniak added inline comments.Oct 22 2021, 5:10 AM
mlir/lib/Rewrite/ByteCode.cpp
457–460

I figured this out: IntervalMap has a user-defined destructor but not a user-defined copy constructor. The compiler-generated implicitly-declared copy constructor is wrong. We should just define a move constructor, which is sufficient for most practical purposes. I will submit a diff next week and reference it here.

sfuniak updated this revision to Diff 384962.Thu, Nov 4, 9:50 PM

Updated comment with the IntervalMap fix.

@rriddle: This is the last remaining diff. If you can re-review it whenever you are free, that would be greatly appreciated. Then we can land all 4 and iterate. Thank you!

rriddle added inline comments.Sun, Nov 14, 11:53 PM
mlir/lib/Rewrite/ByteCode.cpp
598

nit: Drop the trivial braces here.

699–704

The debug related functionality feels separable, can you split this out?

941

Could probably use llvm::SaveAndRestore here.

1096
1535–1536

nit: Please move comments to their own line.

1652

We don't really need the reference.

1784

Walking the user count is going to be O(N), what's the trade off here vs. not reserving? Are we banking on a small number of users in practice?

1797

typo: acceessing

1798–1800

Why do we need all of this? Can we augment/add an overload of executeGetOperandsResults for the case we are interested in?

sfuniak marked 20 inline comments as done.Tue, Nov 16, 10:28 PM
sfuniak added inline comments.
mlir/lib/Rewrite/ByteCode.cpp
699–704

will do

941

I find llvm::SaveAndRestore a bit too magical and no less verbose. It certainly helps in cases when we set a variable to an arbitrary value, but in this case, we are increasing and decreasing level, which I think is cleaner as is. We do not have exceptions to guard against either. But you are the code owner, so the decision ultimately rests with you.

1784

Not reserving will require several allocs, one alloc every time we expand the storage. I figured walking the users upfront would be cheaper (though still O(N)). But on the second thought, the number of users will often be small, so maybe SmallVector will do.

1798–1800

We need the comparison, because we need to check if the extracted operand(s) at the specified position match(es) the given value(s). I will create an overload of executeGetOperandsResults as you suggested that does this.

sfuniak updated this revision to Diff 387832.Tue, Nov 16, 10:28 PM
sfuniak marked 2 inline comments as done.

Split out the debug functionality & further review feedback.

Mogball added a comment.EditedWed, Nov 17, 9:54 PM

You mentioned that the cost (when running the bytecode) for pattern sets with no multi-root patterns should be negligible, but wouldn't pattern compilation slow down even for pattern sets without any multi-root patterns? (Because they all have to be scanned).

mlir/lib/Rewrite/ByteCode.cpp
1542

You could just assert(kind == PDLValue::Kind::Operation given that the only other case in this switch is an abort.

1796

If a null value is read, you could just exit early (and set range to empty).

1802

Same here. If there are no values to read (either null pointer or empty range) you can just return.

1815

The implementation of get_users in the bytecode is a little bit complex, stemming from the four cases of whether an index is specified cross whether the operand is a value or value range. Would this be any better with two different ops? One with an index and one without. It'll get rid of the sentinel value too.

There are two places where we do an extra traversal: 1) when verifying the PDL pattern (checking for connectivity), and 2) when forming the predicate tree (detecting the roots). Both are O(E), where E is the total number of operands across all the operations in the PDL pattern -- hardly an expensive operation. Just to be sure, I did a barebones test, where I took 100 copies of test/Conversion/PDLToPDLInterp/pdl-to-pdl-interp-matcher.mlir concatenated together, and ran mlir-opt -split-input-file -convert-pdl-to-pdl-interp > /dev/null on the resulting file. The runtime of my version and the baseline version of mlir-opt were the same.

mlir/lib/Rewrite/ByteCode.cpp
1542

That's true for now, but we are going to soon follow this with another diff where the iteration type is a ValueRange. I thought I'd anticipate the change.

1815

That's a good point, I will create a new op code and split this up into two execute functions.

sfuniak updated this revision to Diff 388154.Thu, Nov 18, 4:17 AM

Split GetUsers into GetUsersAll and GetUsersAt.

There are two places where we do an extra traversal: 1) when verifying the PDL pattern (checking for connectivity), and 2) when forming the predicate tree (detecting the roots). Both are O(E), where E is the total number of operands across all the operations in the PDL pattern -- hardly an expensive operation. Just to be sure, I did a barebones test, where I took 100 copies of test/Conversion/PDLToPDLInterp/pdl-to-pdl-interp-matcher.mlir concatenated together, and ran mlir-opt -split-input-file -convert-pdl-to-pdl-interp > /dev/null on the resulting file. The runtime of my version and the baseline version of mlir-opt were the same.

Thanks for checking!

Just to be clear, this concern doesn't preclude any of the changes you've suggested. I'm just wondering whether explicitly marking a pattern as single-rooted or multi-rooted would be useful (should the performance have been a concern).

mlir/lib/Rewrite/ByteCode.cpp
1542

Gotcha

1796

Unrelated, but this makes me think we need an AttrSizedSegments interface that hides a lot of this work for us.

1824–1838

For this one here, I think you can extract the loop and move it below the if/else of Value vs ValueRange. These functions look much cleaner now, thanks!

Mogball added inline comments.Thu, Nov 18, 9:58 AM
mlir/lib/Rewrite/ByteCode.cpp
914–915
sfuniak updated this revision to Diff 388368.Thu, Nov 18, 8:15 PM

Simplified GetUsers.

Mogball accepted this revision.Fri, Nov 19, 11:47 AM

Looking pretty good to me.

mlir/lib/Rewrite/ByteCode.cpp
1544
rriddle added inline comments.Fri, Nov 19, 11:54 AM
mlir/lib/Rewrite/ByteCode.cpp
138

or the first value in a range

Is this right?

1752–1754

Unresolved?

mlir/lib/Rewrite/ByteCode.h
31

I think we shouldn't define OpRange like this, given that it has a different ownership model than TypeRange/ValueRange.

sfuniak added inline comments.Sun, Nov 21, 2:54 PM
mlir/lib/Rewrite/ByteCode.cpp
1752–1754

Grabbing only the first value in a range is consistent with the definition of pdl_interp.get_defining_op.

I believe, the present definition is correct. You will match the entirety of a value range, so we could take the users of any value. We could also take the intersection of users, but that's more costly and not really needed, because we follow up the users query with operand comparison.

mlir/lib/Rewrite/ByteCode.h
31

Okay... can I just rename it to OwningOpRange then? I still think that using an owning reference below in std::vector<OpRange> opRangeMemory; is the right thing to do (no matter what we choose to call it).

sfuniak updated this revision to Diff 388846.Mon, Nov 22, 3:07 AM

Fixed the semantics of get_users for value range and implemented pdl_interp.get_value.

rriddle accepted this revision.Wed, Nov 24, 2:58 PM

LGTM after resolving the comments in the parent and pulling in the changes to here. I'd really like to get this in tree and iterate from there.

mlir/lib/Rewrite/ByteCode.h
31

Yeah, OwningOpRange is fine. Only had problems with the name, just want to avoid a situation where we would forget that OpRange has different semantics.

This revision is now accepted and ready to land.Wed, Nov 24, 2:58 PM
sfuniak updated this revision to Diff 389898.Thu, Nov 25, 9:12 PM

Implemented pdl_interp.extract.

sfuniak updated this revision to Diff 389904.Thu, Nov 25, 9:46 PM

Minor cleanups.

This revision was automatically updated to reflect the committed changes.