This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Move symbol loading from mlir-cpu-runner to ExecutionEngine.
ClosedPublic

Authored by ingomueller-net on Jun 15 2023, 7:46 AM.

Details

Summary

Both the mlir-cpu-runner and the execution engine allow to provide a
list of shared libraries that should be loaded into the process such
that the jitted code can use the symbols from those libraries. The
runner had implemented a protocol that allowed libraries to control
which symbols it wants to provide in that context (with a function
called __mlir_runner_init). In absence of that, the runner would rely on
the loading mechanism of the execution engine, which didn't do anything
particular with the symbols, i.e., only symbols with public visibility
were visible to jitted code.

Libraries used a mix of the two mechanisms: while the runner utils and C
runner utils libs (and potentially others) used public visibility, the
async runtime lib (as the only one in the monorepo) used the loading
protocol. As a consequence, the async runtime library could not be used
through the Python bindings of the execution engine.

This patch moves the loading protocol from the runner to the execution
engine. For the runner, this should not change anything: it lets the
execution engine handle the loading which now implements the same
protocol that the runner had implemented before. However, the Python
binding now get to benefit from the loading protocol as well, so the
async runtime library (and potentially other out-of-tree libraries) can
now be used in that context.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2023, 7:46 AM
ingomueller-net requested review of this revision.Jun 15 2023, 7:46 AM

I am really unsure about this patch. I don't understand how the underlying ORC works and have moved the code quite mechanically, so please review carefully.

The test suite seems to work, but I get occasional errors for test/mlir-cpu-runner/async* tests, which is suspicious. Running each test in isolation has never failed, though, and rerunning the whole suite always resulted in different failures, which isn't a pattern I would expect as a consequence to this patch, though.

Also, as potential follow up, I could implement the loading protocol for some of the other libraries (runner utils and C runner utils).

mlir/lib/ExecutionEngine/ExecutionEngine.cpp
320

I am not 100% sure why this is necessary. libPath is a StringRef, so it may not be null terminated. In one test of mine, it apparently wasn't, so I saw extra characters. However, both now and before, that StringRef is/was produced by make_absolute. Maybe it was broken before and nobody noticed...

372

These are actually never called in the patch! I don't know where the right place would be...

403

Is this the right place to do this (and all of the above)?

This is what I had in mind! Thanks.

If we could document this somewhere as well?

mlir/lib/ExecutionEngine/ExecutionEngine.cpp
372

In the destructor of the ExecutionEngine?

Implement calling of destroy functions in ~ExecutionEngine and document the loading/unloading protocol.

This is what I had in mind! Thanks.

If we could document this somewhere as well?

Thanks for the guidance! I added some documentation in the source code of the corresponding option and implemented calling the destroy functions. Let me know what you think!

mehdi_amini accepted this revision.Jun 16 2023, 6:46 AM
This revision is now accepted and ready to land.Jun 16 2023, 6:46 AM

@mehdi_amini: There is an effect/issue that I don't yet fully understand, which I am running into while porting the runner utils library to this mechanism. The issue is that I get doubly defined symbols if there is a forward declaration of a function with the emit_c_interface attribute. For example, if I add the following line to one of the unit tests that use the mlir_async_runtime:

func.func private @mlirAsyncRuntimeAddRef(!llvm.ptr, i64) -> () attributes { llvm.emit_c_interface }

then I get the following output and a stackdump:

Failure value returned from cantFail wrapped call
Duplicate definition of symbol 'printMemrefF32'
UNREACHABLE executed at /usr/local/google/home/ingomueller/git/iree-llvm-sandbox/third_party/llvm-project/llvm/include/llvm/Support/Error.h:755!

If the function does not have that attribute, it works fine. For the libraries that currently do *not* use the init mechanism of this patch yet, there is no problem. Many tests contain something like the following for symbols from the runner utils lib:

func.func private @printMemrefF32(memref<*xf32>)
  attributes { llvm.emit_c_interface }

I dug into this a bit and while I don't understand everything involved yet, I have the feeling that the underlying problem is that the emitted C wrapper uses the symbol from the forward declaration, so ORC doesn't want to override it anymore. More precisely, JITDylib::defineImpl(MaterializationUnit &MU) (from llvm/.../Orc/Core.cpp) seems to find the symbol in not being in SymbolState::NeverSearched and thus considers the symbol as not overridable.

As I am writing this, it seems like changing the order in which the symbols are loaded (provided shared libs vs. the IR module that should be run) could avoid this problem but I *think* I have tried this earlier and it didn't work. I'll try this again on Monday and report back, but I am running out of time now.

Should you have a guess on the top of your head (and be at work before I start on Monday), let me know.

Nothing on top of my head!

Quick update: I think the problem is that the loading mechanism from this patch loads the symbols into getMainJITDylib() (which is where the jitted code lives) whereas symbols from "normal" shared libraries are loaded into a new createBareJITDylib(). In the former case, they may collide with symbols from the MLIR input; in the latter case, they get a fresh symbol space. I'll investigate more but my feeling is that the libraries loaded by the loading mechanism should also live in their own JITDylib...

Another update: I think that the problem is actually with FuncToLLVM and how it emits the C interface wrapper. The following function

func.func private @mlirAsyncRuntimeAddRef(!llvm.ptr, i64) -> ()

gets lowered into

llvm.func @mlirAsyncRuntimeAddRef(!llvm.ptr, i64) attributes {sym_visibility = "private"}

which is external, whereas the if the emit_c_interface attribute is added like this:

func.func private @mlirAsyncRuntimeAddRef(!llvm.ptr, i64) -> ()
  attributes { llvm.emit_c_interface }

then it is lowered to the following, which is internal:

llvm.func @mlirAsyncRuntimeAddRef(%arg0: !llvm.ptr, %arg1: i64) attributes {llvm.emit_c_interface, sym_visibility = "private"} {
  llvm.call @_mlir_ciface_mlirAsyncRuntimeAddRef(%arg0, %arg1) : (!llvm.ptr, i64) -> ()
  llvm.return
}
llvm.func @_mlir_ciface_mlirAsyncRuntimeAddRef(!llvm.ptr, i64) attributes {llvm.emit_c_interface, sym_visibility = "private"}

Adding this attribute turns an external function into an internal one. That doesn't sound right, does it?

@ftynse, @Mogball, @Dinistro, @gysit: You also have worked on FuncToLLVM. Any opinions? (I think you can ignore the rest of the patch; just look at this comment.)

Another update: I think that the problem is actually with FuncToLLVM and how it emits the C interface wrapper. The following function

func.func private @mlirAsyncRuntimeAddRef(!llvm.ptr, i64) -> ()

gets lowered into

llvm.func @mlirAsyncRuntimeAddRef(!llvm.ptr, i64) attributes {sym_visibility = "private"}

which is external,

It says sym_visibility = "private", what do you mean by "external" here?

I must be using the wrong terms. In C++ terminology, it turns a declaration (just the signature) into a definition (including a body).

I must be using the wrong terms. In C++ terminology, it turns a declaration (just the signature) into a definition (including a body).

Ah got it!

Seems like the issue is that the runtime library should always expose symbols by prefixing them with _mlir_ciface_ and it's not the case here?

Seems like the issue is that the runtime library should always expose symbols by prefixing them with _mlir_ciface_ and it's not the case here?

It's not the case but I can't really speak to whether it should be or not.

Independently, I find it problematic that adding this attribute turns a "declaration" into a "definition". Don't you agree? Independently of prefixes, this may cause the problem that I am running into here: IR that wanted to refer to a function provided by some other component suddenly overrides that function...

Independently, I find it problematic that adding this attribute turns a "declaration" into a "definition". Don't you agree? Independently of prefixes, this may cause the problem that I am running into here: IR that wanted to refer to a function provided by some other component suddenly overrides that function...

I don't think this is necessarily a problem, there is a convention: at a high-level of the IR you're saying "I want to call an external function, for example: call @printMemref1dI32(%x0) : (memref<?xi32>) -> ().

Now when you have a declaration func.func private @printMemref1dI32(%ptr : memref<?xi32>) attributes { llvm.emit_c_interface }, what the llvm.emit_c_interface attribute is saying is "this function is implemented in runtime library that exposes a C API calling convention and the method will be prefixed with _mlir_ciface_ ".

That does not say anything about the name to use for the wrapper we insert though, but it is a private function so its name should not matter in any way?

Independently, I find it problematic that adding this attribute turns a "declaration" into a "definition". Don't you agree? Independently of prefixes, this may cause the problem that I am running into here: IR that wanted to refer to a function provided by some other component suddenly overrides that function...

...
That does not say anything about the name to use for the wrapper we insert though, but it is a private function so its name should not matter in any way?

Aha, I see. So that "definition" that is inserted is only private, hence it should not clash with any function exported by any other component (but transparently override it, in this case by calling a function with a different name).

I guess then my previous guess went into a better direction: the symbols exported through the loading mechanism should live in a different "symbol space" than the jitted IR, no? In that case, the private functions from the jitted IR module would only be visible there and not collide with those from the libraries. However, currently, the symbols from the libraries are loaded into the same place as those from the jitted module, hence the collisions I currently observe.

If you agree, I'll look into the linking mechanism more and see how I can load the symbols of each library into their own place.

Seems like the issue is that the runtime library should always expose symbols by prefixing them with _mlir_ciface_ and it's not the case here?

The definition that calls the _mlir_ciface_* function should only be generated once, ideally in the same module that holds the original definition. So when the definition is not available anywhere, the entity that defines the original symbol should also provided the _mlir_ciface_ symbol, I guess.

Note that this might be a bit messy if one has no control over the external parts, though.

That does not say anything about the name to use for the wrapper we insert though, but it is a private function so its name should not matter in any way?

Unfortunately, that's not correct. Consider the following the function @ingomueller-net showed before:

llvm.func @mlirAsyncRuntimeAddRef(%arg0: !llvm.ptr, %arg1: i64) attributes {llvm.emit_c_interface, sym_visibility = "private"} {
  llvm.call @_mlir_ciface_mlirAsyncRuntimeAddRef(%arg0, %arg1) : (!llvm.ptr, i64) -> ()
  llvm.return
}

While the symbol has private visibility, the linkage of this function is external (default visibility, thus not printed).

Unfortunately, that's not correct. Consider the following the function @ingomueller-net showed before:

llvm.func @mlirAsyncRuntimeAddRef(%arg0: !llvm.ptr, %arg1: i64) attributes {llvm.emit_c_interface, sym_visibility = "private"} {
  llvm.call @_mlir_ciface_mlirAsyncRuntimeAddRef(%arg0, %arg1) : (!llvm.ptr, i64) -> ()
  llvm.return
}

While the symbol has private visibility, the linkage of this function is external (default visibility, thus not printed).

True! If I convert this to LLVM proper, I get:

define void @mlirAsyncRuntimeAddRef(ptr %0, i64 %1) {
  call void @_mlir_ciface_mlirAsyncRuntimeAddRef(ptr %0, i64 %1)
  ret void
}
declare void @_mlir_ciface_mlirAsyncRuntimeAddRef(ptr, i64)

Both symbols have thus external linkage and default visibility, right?

True! If I convert this to LLVM proper, I get:

define void @mlirAsyncRuntimeAddRef(ptr %0, i64 %1) {
  call void @_mlir_ciface_mlirAsyncRuntimeAddRef(ptr %0, i64 %1)
  ret void
}
declare void @_mlir_ciface_mlirAsyncRuntimeAddRef(ptr, i64)

Both symbols have thus external linkage and default visibility, right?

Yeah, I'm just not sure if this is what you really want. If you have two modules that declared @ mlirAsyncRuntimeAddRef before the lowering, you will have the same definition in both modules. This will cause linking issues eventually.

Yeah, I'm just not sure if this is what you really want. If you have two modules that declared @ mlirAsyncRuntimeAddRef before the lowering, you will have the same definition in both modules. This will cause linking issues eventually.

In the context of this discussion, one of the two comes from a shared library written in C++ that defines a function with this name. Some unit tests want to call this function, so they add a declaration to an external function. However, they also specify the emit_c_interface attribute, so after lowering they suddenly contain a definition with external visibility of that function, which then clashes with that of the shared library.

llvm.func @mlirAsyncRuntimeAddRef(%arg0: !llvm.ptr, %arg1: i64) attributes {llvm.emit_c_interface, sym_visibility = "private"} {
  llvm.call @_mlir_ciface_mlirAsyncRuntimeAddRef(%arg0, %arg1) : (!llvm.ptr, i64) -> ()
  llvm.return
}

While the symbol has private visibility, the linkage of this function is external (default visibility, thus not printed).

While that seems to be true currently, maybe it shouldn't be the case. The IR module is declaring function it knows exists externally, so the lowering should not emit a wrapper function with the same name, default visibility, and external linkage. That's guaranteed to clash, no? If, instead, the lowering emitted a private function that calls the C interface equivalent, then only this IR module would see it as intended, no?

While the symbol has private visibility, the linkage of this function is external (default visibility, thus not printed).

Something does not line up here, unless we're using a different terminology for these attributes, I assumed we're mapping these attributes to LLVM: https://llvm.org/docs/LangRef.html#linkage-types
"private" in LLVM is not an external linkage (there is inconsistency with MLIR visibility here when the MLIR symbol is public but the LLVM visibility would be private...).

While the symbol has private visibility, the linkage of this function is external (default visibility, thus not printed).

While that seems to be true currently, maybe it shouldn't be the case. The IR module is declaring function it knows exists externally, so the lowering should not emit a wrapper function with the same name, default visibility, and external linkage. That's guaranteed to clash, no?

We actually don't know that it exists externally! (it may or may not... the runtime library should only provide an __mlir_ciface_ symbol: we have not reason to expect another one).

If, instead, the lowering emitted a private function that calls the C interface equivalent, then only this IR module would see it as intended, no?

This is what we should be always doing I believe: if we emit a wrapper, regardless of the name, the only way to avoid accidental collision is to have it private.

llvm.func @mlirAsyncRuntimeAddRef(%arg0: !llvm.ptr, %arg1: i64) attributes {llvm.emit_c_interface, sym_visibility = "private"} {

Seems like "sym_visibility" here is the MLIR annotation for symbols (should be moved to properties...), and nothing specific to LLVM. It is weirdly printed here with the attributes instead of prefixed like func.func

While the symbol has private visibility, the linkage of this function is external (default visibility, thus not printed).

Something does not line up here, unless we're using a different terminology for these attributes, I assumed we're mapping these attributes to LLVM: https://llvm.org/docs/LangRef.html#linkage-types
"private" in LLVM is not an external linkage (there is inconsistency with MLIR visibility here when the MLIR symbol is public but the LLVM visibility would be private...).

That's the source of all confusion. llvm.func has a separate linkage attribute which is independent of the symbol visibility. https://mlir.llvm.org/docs/Dialects/LLVM/#llvmfunc-llvmllvmfuncop
I was not involved in the original design, but I assume the following reasons.
Symbol visibility is an MLIR concept that defines where symbols can be seen, this does not necessarily correspond to how these symbols are linked. For example, a func.func declaration is always private, as it's symbol could otherwise collide with the corresponding definition.
Furthermore, there are only 3 kinds of symbol visibility, while there are many more linkage types.