This is an archive of the discontinued LLVM Phabricator instance.

Add support for Lazyloading to the MLIR bytecode
ClosedPublic

Authored by mehdi_amini on Apr 29 2023, 2:40 AM.

Details

Summary

IsolatedRegions are emitted in sections in order for the reader to be
able to skip over them. A new class is exposed to manage the state and
allow the readers to load these IsolatedRegions on-demand.

Diff Detail

Event Timeline

mehdi_amini created this revision.Apr 29 2023, 2:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2023, 2:40 AM
mehdi_amini requested review of this revision.Apr 29 2023, 2:40 AM
rriddle accepted this revision.Apr 29 2023, 2:57 AM

Nice! This is pretty much what I had in mind originally. Getting the bytecode format is the major part of this, we can tune the policy/API as we use it.

mlir/include/mlir/Bytecode/BytecodeReader.h
38–39
44–48

Why return a functor instead of driving this through BytecodeReader? The returned functor references the BytecodeReader, so what are the benefits to returning a functor? The only situation I can think of is that it makes things slightly simpler when reading multiple bytecode files (which I run into a lot for my needs), removing the need to remember which one holds each op.

47

Maybe getLazyOpMaterializer? Not blocking at all, but might be nice to keep consistent language for this.

mlir/lib/Bytecode/Reader/BytecodeReader.cpp
1124

Can you document this? Can be a simple callout to the main method (just feels weird having methods undocumented).

1130

Can you add messages to these asserts?

1321

Please document this.

1690–1717
1697–1700
1740

Please add messages to the asserts.

mlir/lib/Bytecode/Writer/BytecodeWriter.cpp
742

Waiting for Jacques patch?

This revision is now accepted and ready to land.Apr 29 2023, 2:57 AM
jpienaar accepted this revision.Apr 29 2023, 5:00 AM

Nice! Thanks

mlir/docs/BytecodeFormat.md
345

Is region_section defined here? (Seems it should be ir_section from below)

mlir/lib/Bytecode/Writer/BytecodeWriter.cpp
742

Would be nice :-)

mlir/test/lib/IR/TestLazyLoading.cpp
52

So if I wanted to extract only a subgraph, one would materilze the op, materialize recursively everything in it while keeping track of referenced symbols and recursively materialize all of them?

60

Nit: error convention is sentence fragments (start lower case and no trailing punctuation)

jpienaar added inline comments.Apr 29 2023, 5:11 AM
mlir/docs/BytecodeFormat.md
345

Did you consider making it so that not all isolated from above are encoded same way? Perhaps stealing a bit from region encoding to indicate encoding. That way one could mark only what is needed and so writer could limit depth and also it's smaller in old encoding. That does complicate the writer though (writer config would need a set or somesuch to track what to encode).

mfrancio accepted this revision.Apr 29 2023, 8:52 AM

Thanks a lot!

mlir/lib/Bytecode/Reader/BytecodeReader.cpp
1654–1658

This line seems unnecessary now?

mehdi_amini added inline comments.Apr 29 2023, 10:45 AM
mlir/docs/BytecodeFormat.md
345

Is region_section defined here? (Seems it should be ir_section from below)

Sections aren't defined, I'm not sure how to express it? I could make it:

region_section {
  region
}

But it is still implicit that it is a section, for example

op {
  name: varint,
  encodingMask: byte,
  ...

There is no section defined here.

As for the encoding, I like the uniformity of the format and the capabilities: it's not clear to me we need to push complexity on the emitter instead of just leaving the flexibility to the consumer. Do you have something in mind?

mlir/include/mlir/Bytecode/BytecodeReader.h
44–48

I was actually trying to have a single map lookup so that the API can do directly "check if this op can be materialized" and "materialize this op" without hitting the map twice! This may not be worth it though: the work to do is heavy enough that a map lookup is negligible.

I'm not super happy with the API right now though, as it's not easy to know which operation need materialization or not.

Another thing is how "unsafe" this all is from the client side, I'd like to think a bit more the API to provide some more guarantees maybe.
For example right now when we parse a file (bytecode or ASM) we verify it, but with lazy-loading we have to disable that (I am not doing it automatically, it has to be done in the config so that it is explicit for the client that verifier isn't performed!).
It is still easy for the client right now to move forward without any verification: what invariants do we want around lazy-loaded input? It is also possible that all use-cases will be so ad-hoc that we can't provide an "easy" API.

mlir/lib/Bytecode/Writer/BytecodeWriter.cpp
742

Yes, I was waiting for Jacques, I'll rebase and fix the TODO!

mlir/test/lib/IR/TestLazyLoading.cpp
52

Yes: you would materialize the module, build a symbol table, then worklist algorithm: add your entry point to the wordlist and while !empty, pop the current function, materialize and collect the symbols, using the symbol table to push the functions to the wordlist.

jpienaar added inline comments.Apr 29 2023, 11:36 AM
mlir/docs/BytecodeFormat.md
345

Good point, we need to tag those a bit differently. Perhaps section<region> and we could have op : Section { ... }.

(Can be follow up, as you mention it's not consistent and we can do a NOP change to make it).

Re encoding: I agree, uniform is better. I was just thinking it's probably rare than one would be lazily loading more than a handful of the regions but there is paying a cost on each. At least size wise, the uniformity does avoid additional cost during serialization/deserialization. I don't have anything quantitative though to back the tradeoff either way.

mehdi_amini added inline comments.Apr 29 2023, 11:47 AM
mlir/docs/BytecodeFormat.md
345

I've been wondering about the performance by the way, right now the cost is (if I don't miss anything):

  • 1 varint for the section size per isolated region
  • 1 entry in a map when lazyloading is enabled

Then in terms of runtime loading I'd be interested to benchmark "always lazy loading" vs "direct loading", it should be very close.
Maybe we need an API on the reader to access the ops to load instead of just getMaterializerOp(Operation *), that'll save traversing the IR.

jpienaar added inline comments.Apr 29 2023, 12:05 PM
mlir/docs/BytecodeFormat.md
345

1 byte for section id and 1 varint (that is if there is no alignment or padding). Not high, but means ~2.4 bytes per function additional. So if you crammed multiple models in ine file, it could add up (especially if you use encodings where everything is a function rather than region ... I've definitely been surprised by what python frontends end up producing). Probably still only at like 100 bytes larger kind of category.

mehdi_amini marked an inline comment as done.
  • Rebase on Jacques' back-deployment support and test it
  • Update the exposed APIs to avoid having to walk the IR to find ops to lazy-load.
rriddle added inline comments.Apr 29 2023, 11:15 PM
mlir/include/mlir/Bytecode/BytecodeReader.h
50–54

When will these be used outside of just testing? I can't think of a situation where this would actually get used.

mehdi_amini added inline comments.Apr 30 2023, 12:01 AM
mlir/include/mlir/Bytecode/BytecodeReader.h
50–54

You could have a list of functions to load and filter based on this?
That said the API isn't good enough yet for this, we can't "drop" things from the worklist without loading them..,.

Update the lazyloading C++ API

jpienaar accepted this revision.May 20 2023, 6:22 AM

Nice - I recall we talked about reachability too and so materializing what is reachable from others materialized and doing that in finalize if needed. But that is probably a higher level function and a more expensive one that could be built on top of this instead.

mlir/include/mlir/Bytecode/BytecodeReader.h
44

This is only needs to be materialized given already materialized right? Ok that sentence is rough, i was mostly trying to think about recursive part and the user shouldn't expect that parser would decent into an unmaterialized op

48

Why couldn't this have same API as finalize and just return whether to materialize or not?

51

Why would a user care about this?

59

make sure to use

62

So upon callback of op to materialize, the callback could then call materialize?

mehdi_amini marked 23 inline comments as done.

Update the callbacks to return a bool to provide control to the client to materialize as we go

mehdi_amini added inline comments.May 20 2023, 3:25 PM
mlir/include/mlir/Bytecode/BytecodeReader.h
44

What about The lazyOps call back is invoked for every ops that contains regions that can be lazy loaded.

I would also change the callback to return a bool to immediately materialize as you suggested below.

51

I don't know, but it's cheap to expose :)

62

It technically can: we shouldn't because it is a recursion, so better use a worklist algorithm, like I did in mlir/test/lib/IR/TestLazyLoading.cpp (around line 70)

This revision was landed with ongoing or failed builds.May 20 2023, 3:25 PM
This revision was automatically updated to reflect the committed changes.

The new test case Bytecode/bytecode-lazy-loading.mlir is failing on s390x due to what seems to be endian issues. (It looks likely that this has just exposed existing endian problems in the bytecode reader.)

Unfortunately the s390x builder was down for a couple of weeks due to maintenance, but now that it's back up, the failing test makes it red. Note that since this patch, three more patches have introduced additional failing test cases on s390x:
https://reviews.llvm.org/D149755 (Bytecode/uselist_orders.mlir)
https://reviews.llvm.org/D151386 (Bytecode/unregistered_dialect.mlir)
https://reviews.llvm.org/D151408 (Bytecode/./MLIRBytecodeTests/Bytecode/MultiModuleWithResource)
all for what appear to be similar (or even the same) endian problems.

See e.g. https://lab.llvm.org/buildbot/#/builders/199/builds/21088 for more details. Looking at the BytecodeReader.cpp file, I can see a number of places that obviously assume little-endian host byte order, e.g. in parseVarInt or parseMultiByteVarInt. Any suggestions on how to handle MLIR bytecode on big-endian hosts?

The new test case Bytecode/bytecode-lazy-loading.mlir is failing on s390x due to what seems to be endian issues. (It looks likely that this has just exposed existing endian problems in the bytecode reader.)

Unfortunately the s390x builder was down for a couple of weeks due to maintenance, but now that it's back up, the failing test makes it red. Note that since this patch, three more patches have introduced additional failing test cases on s390x:
https://reviews.llvm.org/D149755 (Bytecode/uselist_orders.mlir)
https://reviews.llvm.org/D151386 (Bytecode/unregistered_dialect.mlir)
https://reviews.llvm.org/D151408 (Bytecode/./MLIRBytecodeTests/Bytecode/MultiModuleWithResource)
all for what appear to be similar (or even the same) endian problems.

See e.g. https://lab.llvm.org/buildbot/#/builders/199/builds/21088 for more details. Looking at the BytecodeReader.cpp file, I can see a number of places that obviously assume little-endian host byte order, e.g. in parseVarInt or parseMultiByteVarInt. Any suggestions on how to handle MLIR bytecode on big-endian hosts?

MLIR bytecode doesn't support big endian hosts at all, so we'll need to just disable tests in those cases, e.g. https://github.com/llvm/llvm-project/blob/52ca6ad755b0cc2aa603cfb3124bf58c04a47005/mlir/test/Bytecode/resources.mlir#L4

We've discussed it before, but most of the engineers working on the bytecode (myself included) don't have easy big-endian platform access.

I assume Qemu can emulate big-endian systems?

A quick Google search yields https://aircrack-ng.blogspot.com/2018/10/to-be-or-not-to-be-using-qemu-to-run.html ; seems like Debian on MIPS is a possible option.

MLIR bytecode doesn't support big endian hosts at all, so we'll need to just disable tests in those cases, e.g. https://github.com/llvm/llvm-project/blob/52ca6ad755b0cc2aa603cfb3124bf58c04a47005/mlir/test/Bytecode/resources.mlir#L4

We've discussed it before, but most of the engineers working on the bytecode (myself included) don't have easy big-endian platform access.

Ah, right, I had forgotten about that. I've had a closer look now to identify the root causes of these failures, and I can see only two actual problems:

  • Encoding/decoding of multi-byte integers in the bytecode reader/writer. This is trivial to fix, and actually in itself resolves nearly all bytecode related endian problems, including these new failures and existing testcases that were already marked unsupported.
  • Handling of "dense resource element attributes". While "regular" attributes have a tryGetValues accessor that returns a custom iterator that performs endian conversion on access, the "resource" attributes use a tryGetAsArrayRef accessor that returns a standard ArrayRef, which does not do any conversion. It seems it should be possible to use a similar custom iterator approach here as well, which would fix the remaining issues.

I'd be happy to come up with patches for these issues, if you'd be willing to consider accepting such changes. Also, I'd be happy to investigate endian problems that might come up in the future. In the alternative (or in addition), we can also make access to an s390x machine available if you want to look into endian issues yourself.

MLIR bytecode doesn't support big endian hosts at all, so we'll need to just disable tests in those cases, e.g. https://github.com/llvm/llvm-project/blob/52ca6ad755b0cc2aa603cfb3124bf58c04a47005/mlir/test/Bytecode/resources.mlir#L4

We've discussed it before, but most of the engineers working on the bytecode (myself included) don't have easy big-endian platform access.

Ah, right, I had forgotten about that. I've had a closer look now to identify the root causes of these failures, and I can see only two actual problems:

  • Encoding/decoding of multi-byte integers in the bytecode reader/writer. This is trivial to fix, and actually in itself resolves nearly all bytecode related endian problems, including these new failures and existing testcases that were already marked unsupported.
  • Handling of "dense resource element attributes". While "regular" attributes have a tryGetValues accessor that returns a custom iterator that performs endian conversion on access, the "resource" attributes use a tryGetAsArrayRef accessor that returns a standard ArrayRef, which does not do any conversion. It seems it should be possible to use a similar custom iterator approach here as well, which would fix the remaining issues.

I'd be happy to come up with patches for these issues, if you'd be willing to consider accepting such changes. Also, I'd be happy to investigate endian problems that might come up in the future. In the alternative (or in addition), we can also make access to an s390x machine available if you want to look into endian issues yourself.

I would love to review those patches, that would be amazing!!!

MLIR bytecode doesn't support big endian hosts at all, so we'll need to just disable tests in those cases, e.g. https://github.com/llvm/llvm-project/blob/52ca6ad755b0cc2aa603cfb3124bf58c04a47005/mlir/test/Bytecode/resources.mlir#L4

We've discussed it before, but most of the engineers working on the bytecode (myself included) don't have easy big-endian platform access.

Ah, right, I had forgotten about that. I've had a closer look now to identify the root causes of these failures, and I can see only two actual problems:

  • Encoding/decoding of multi-byte integers in the bytecode reader/writer. This is trivial to fix, and actually in itself resolves nearly all bytecode related endian problems, including these new failures and existing testcases that were already marked unsupported.
  • Handling of "dense resource element attributes". While "regular" attributes have a tryGetValues accessor that returns a custom iterator that performs endian conversion on access, the "resource" attributes use a tryGetAsArrayRef accessor that returns a standard ArrayRef, which does not do any conversion. It seems it should be possible to use a similar custom iterator approach here as well, which would fix the remaining issues.

I'd be happy to come up with patches for these issues, if you'd be willing to consider accepting such changes. Also, I'd be happy to investigate endian problems that might come up in the future. In the alternative (or in addition), we can also make access to an s390x machine available if you want to look into endian issues yourself.

I would love to review those patches, that would be amazing!!!

+1, thanks!

I'd be happy to come up with patches for these issues, if you'd be willing to consider accepting such changes. Also, I'd be happy to investigate endian problems that might come up in the future. In the alternative (or in addition), we can also make access to an s390x machine available if you want to look into endian issues yourself.

I would love to review those patches, that would be amazing!!!

I've now posted a patch for the bytecode reader/writer problem here: https://reviews.llvm.org/D153567
The resource decoding problem proved more complicated than I initially thought - I think I'll open an issue to discuss in more detail.

I'd be happy to come up with patches for these issues, if you'd be willing to consider accepting such changes. Also, I'd be happy to investigate endian problems that might come up in the future. In the alternative (or in addition), we can also make access to an s390x machine available if you want to look into endian issues yourself.

I would love to review those patches, that would be amazing!!!

I've now posted a patch for the bytecode reader/writer problem here: https://reviews.llvm.org/D153567
The resource decoding problem proved more complicated than I initially thought - I think I'll open an issue to discuss in more detail.

Issue now open as https://github.com/llvm/llvm-project/issues/63469 - any comments welcome!