This is an archive of the discontinued LLVM Phabricator instance.

[mlir-cpu-runner] Add export_executable_symbols in CMake.
ClosedPublic

Authored by awarzynski on Mar 27 2023, 12:29 AM.

Details

Summary

LLJIT needs access to symbols (e.g. llvm_orc_registerEHFrameSectionWrapper)
that will be defined in the executable when LLVM is linked statically.

This change is consistent with how other tools within LLVM use LLJIT. It
is required to make sure that mlir-cpu-runner --host-supports-jit
correctly returns true on platforms that do support JITting (in my
case that's AArch64 Linux).

See https://github.com/llvm/llvm-project/issues/61712 for more
context.

Diff Detail

Event Timeline

awarzynski created this revision.Mar 27 2023, 12:29 AM
awarzynski requested review of this revision.Mar 27 2023, 12:29 AM
awarzynski edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.Mar 28 2023, 5:05 AM
hctim added a subscriber: hctim.Mar 29 2023, 8:52 AM

Hi folks, looks like this broke ASan buildbots because of an ODR. Full reproduction instructions can be found at https://github.com/google/sanitizers/wiki/SanitizerBotReproduceBuild (either using buildbot_fast.sh or buildbot_bootstrap_asan.sh), or alternatively this is quickly reproducible without a multi-stage build using my example cmake config below (which definitely has some stuff you don't need, like libc):

$ cmake \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DLLVM_USE_LINKER=lld \
-GNinja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-fsanitize=address" \
-DCMAKE_CXX_FLAGS="-fsanitize=address" \
-DLLVM_ENABLE_PROJECTS="'clang;lld;clang-tools-extra;mlir'" \
-DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi" \
-DLLVM_LIBC_ENABLE_LINTING=OFF \
-DLLVM_USE_SANITIZER=Address \
-DLLVM_ENABLE_ASSERTIONS=On \
/path/to/llvm/llvm
$ ninja check-mlir

The bots say (https://lab.llvm.org/buildbot/#/builders/5/builds/32500):

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.
FAIL: MLIR :: mlir-cpu-runner/async-group.mlir (70914 of 73663)
******************** TEST 'MLIR :: mlir-cpu-runner/async-group.mlir' FAILED ********************
Script:
--
: 'RUN: at line 6';   export LSAN_OPTIONS=detect_leaks=0
: 'RUN: at line 8';     /b/sanitizer-x86_64-linux-fast/build/llvm_build_asan/bin/mlir-opt /b/sanitizer-x86_64-linux-fast/build/llvm-project/mlir/test/mlir-cpu-runner/async-group.mlir -pass-pipeline="builtin.module(async-to-async-runtime,func.func(async-runtime-ref-counting,async-runtime-ref-counting-opt),convert-async-to-llvm,func.func(convert-arith-to-llvm),convert-func-to-llvm,reconcile-unrealized-casts)"  | /b/sanitizer-x86_64-linux-fast/build/llvm_build_asan/bin/mlir-cpu-runner                                                           -e main -entry-point-result=void -O0                                    -shared-libs=/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan/lib/libmlir_c_runner_utils.so       -shared-libs=/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan/lib/libmlir_runner_utils.so         -shared-libs=/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan/lib/libmlir_async_runtime.so    | /b/sanitizer-x86_64-linux-fast/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-fast/build/llvm-project/mlir/test/mlir-cpu-runner/async-group.mlir
--
Exit Code: 2
Command Output (stderr):
--
=================================================================
==738620==ERROR: AddressSanitizer: odr-violation (0x7f8fa7f02c40):
  [1] size=4 'llvm::EnableABIBreakingChecks' /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Support/ABIBreak.cpp
  [2] size=4 'llvm::EnableABIBreakingChecks' /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Support/ABIBreak.cpp
These globals were registered at these points:
  [1]:
    #0 0x55690da70a27 in __asan_register_globals /b/sanitizer-x86_64-linux-fast/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:356:3
    #1 0x7f8fa7c4703e in asan.module_ctor ABIBreak.cpp
    #2 0x7f8fc63be47d  (/lib64/ld-linux-x86-64.so.2+0x647d) (BuildId: 61ef896a699bb1c2e4e231642b2e1688b2f1a61e)
  [2]:
    #0 0x55690da70a27 in __asan_register_globals /b/sanitizer-x86_64-linux-fast/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:356:3
    #1 0x55690db2c3de in asan.module_ctor ABIBreak.cpp
    #2 0x7f8fc5d10eba in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29eba) (BuildId: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
==738620==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
SUMMARY: AddressSanitizer: odr-violation: global 'llvm::EnableABIBreakingChecks' at /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Support/ABIBreak.cpp
==738620==ABORTING
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /b/sanitizer-x86_64-linux-fast/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-fast/build/llvm-project/mlir/test/mlir-cpu-runner/async-group.mlir

@hctim , cheers for reverting and apologies for the bother :( I confirm that I can reproduce this - many thanks for simplifying the steps. This is going to be fun to debug 😅 !

No problems at all! In case you're not familiar, ODR is explained a little in our docs (and the other stuff that's linked there). Best bet to track it down is to figure out why llvm/lib/Support/ABIBreak.cpp is linked twice.

I'd grab the mangled version of llvm::EnableABIBreakingChecks by running llvm-nm over mlir-cpu-runner, then finding the link line for mlir-cpu-runner from ninja -vn mlir-cpu-runner, and adding -Wl,--trace-symbol=MANGLED_SYMBOL_NAME to see which .o files have definitions of the symbol, and figure out why it's coming from two different spots :)

awarzynski added a comment.EditedMar 29 2023, 12:57 PM

No problems at all! In case you're not familiar, ODR is explained a little in our docs (and the other stuff that's linked there). Best bet to track it down is to figure out why llvm/lib/Support/ABIBreak.cpp is linked twice.

I'd grab the mangled version of llvm::EnableABIBreakingChecks by running llvm-nm over mlir-cpu-runner, then finding the link line for mlir-cpu-runner from ninja -vn mlir-cpu-runner, and adding -Wl,--trace-symbol=MANGLED_SYMBOL_NAME to see which .o files have definitions of the symbol, and figure out why it's coming from two different spots :)

Thanks for the pointers! I do see that both mlir-cpu-runner and libmlir_async_runtime.so define EnableABIBreakingChecks:

$ nm libmlir_async_runtime.so | grep EnableABIBreakingChecks
000000000038cb80 B _ZN4llvm23EnableABIBreakingChecksE

$ nm mlir-cpu-runner | grep EnableABIBreakingChecks
000000000474364c B _ZN4llvm23EnableABIBreakingChecksE

IIUC, the problem comes from this RUN line:

// RUN: | mlir-cpu-runner  \
// RUN:     -e main -entry-point-result=void -O0 \
// RUN:     -shared-libs=%mlir_c_runner_utils  \
// RUN:     -shared-libs=%mlir_runner_utils  \
// RUN:     -shared-libs=%mlir_async_runtime  \

I'm guessing that this is due to dlopen (which is what -shared-libs translates to). Which makes me think - would ASAN_OPTIONS=detect_odr_violation=0 be the right solution/work-around? Not ideal 🤔 . I could add a GitHub issue to let people know that longer term we should consider some other fix. WDYT?

Btw, I've tried your trick with -Wl,--trace-symbol=MANGLED_SYMBOL_NAME (very neat!), but there seems be only one definition of EnableABIBreakingChecks in libraries linked into mlir-cpu-runner:

ninja mlir-cpu-runner | grep def
lib/libLLVMSupport.a(ABIBreak.cpp.o): definition of _ZN4llvm23EnableABIBreakingChecksE

This would confirm my suspicion that my CMake change doesn't really break anything. It does, however, "unblock" the tests in mlir/test/mlir-cpu-runner (see https://github.com/llvm/llvm-project/issues/61712 for context). I suspect that that ASAN simply didn't run these tests before my change.

What are your thoughts? And thanks for all the input - that's greatly appreciated!

-Andrzej

hctim added a comment.Mar 29 2023, 3:53 PM

Thanks for the pointers! I do see that both mlir-cpu-runner and libmlir_async_runtime.so define EnableABIBreakingChecks:

$ nm libmlir_async_runtime.so | grep EnableABIBreakingChecks

000000000038cb80 B _ZN4llvm23EnableABIBreakingChecksE

$ nm mlir-cpu-runner | grep EnableABIBreakingChecks
000000000474364c B _ZN4llvm23EnableABIBreakingChecksE

IIUC, the problem comes from this RUN line:

// RUN: | mlir-cpu-runner                                                      \
// RUN:     -e main -entry-point-result=void -O0                               \
// RUN:     -shared-libs=%mlir_c_runner_utils  \
// RUN:     -shared-libs=%mlir_runner_utils    \
// RUN:     -shared-libs=%mlir_async_runtime   \

I'm guessing that this is due to dlopen (which is what -shared-libs translates to). Which makes me think - would ASAN_OPTIONS=detect_odr_violation=0 be the right solution/work-around? Not ideal 🤔 . I could add a GitHub issue to let people know that longer term we should consider some other fix. WDYT?

Yeah, dlopen-ing a symbol that's already defined in the main executable would definitely do it. You shouldn't just disable ODR checking, as this is a real bug.

I had a look at our internal codebase, and saw three symbols (including the one you listed in your bug) being allow-listed with the following flags:

-Wl,--export-dynamic-symbol=llvm_orc_registerJITLoaderGDBWrapper
-Wl,--export-dynamic-symbol=llvm_orc_registerEHFrameSectionWrapper
-Wl,--export-dynamic-symbol=llvm_orc_deregisterEHFrameSectionWrapper

Do you think that'd do it for you instead of exporting every linked symbol from the executable?

hctim added a comment.Mar 29 2023, 3:55 PM

Yeah, dlopen-ing a symbol that's already defined in the main executable would definitely do it. You shouldn't just disable ODR checking, as this is a real bug.

I had a look at our internal codebase, and saw three symbols (including the one you listed in your bug) being allow-listed with the following flags:

-Wl,--export-dynamic-symbol=llvm_orc_registerJITLoaderGDBWrapper
-Wl,--export-dynamic-symbol=llvm_orc_registerEHFrameSectionWrapper
-Wl,--export-dynamic-symbol=llvm_orc_deregisterEHFrameSectionWrapper

Do you think that'd do it for you instead of exporting every linked symbol from the executable?

Ah, couple more cxx bits as well, which may be unnecessary here:

-Wl,--undefined=_ZTIi
-Wl,--export-dynamic-symbol=_ZTIi
-Wl,--export-dynamic-symbol=__cxa_begin_catch
-Wl,--export-dynamic-symbol=__cxa_end_catch
-Wl,--export-dynamic-symbol=__gxx_personality_v0
-Wl,--export-dynamic-symbol=__cxa_allocate_exception
-Wl,--export-dynamic-symbol=__cxa_throw

Btw, I've tried your trick with -Wl,--trace-symbol=MANGLED_SYMBOL_NAME (very neat!), but there seems be only one definition of EnableABIBreakingChecks in libraries linked into mlir-cpu-runner:

ninja mlir-cpu-runner | grep def
lib/libLLVMSupport.a(ABIBreak.cpp.o): definition of _ZN4llvm23EnableABIBreakingChecksE

This would confirm my suspicion that my CMake change doesn't really break anything.

No, I was wrong. I assumed that ODR would apply regardless of whether the symbol is exported or not, but that is not correct. I realised this after trying your most recent recommendation:

$ cmake -DCMAKE_EXE_LINKER_FLAGS="-Wl,--export-dynamic-symbol=llvm_orc_registerJITLoaderGDBWrapper -Wl,--export-dynamic-symbol=llvm_orc_registerEHFrameSectionWrapper -Wl,--export-dynamic-symbol=llvm_orc_deregisterEHFrameSectionWrapper" .
$ ninja mlir-cpu-runner
# Only check dynamic/exported symbols
$ llvm-nm -D bin/mlir-cpu-runner | grep EnableABIBreakingChecks
# Check all symbols
$ llvm-nm  bin/mlir-cpu-runner | grep EnableABIBreakingChecks
000000000c223420 B _ZN4llvm23EnableABIBreakingChecksE

So, EnableABIBreakingChecks is still defined inside mlir-cpu-runner, but isn't exported and everything else works (I've verified that).

Do you think that'd do it for you instead of exporting every linked symbol from the executable?

That does work for me and I can't think of any other solution. I will send an updated patch shortly. @lhames , does this make sense from the ORC JIT perspective?

awarzynski reopened this revision.Mar 30 2023, 1:57 AM
This revision is now accepted and ready to land.Mar 30 2023, 1:57 AM

Replace export_executable_symbols with explicit linker export flags

Does removing EnableABIBreakingChecks from the export list actually resolve the ODR violation? As I understand it there would still be two copies of the variable, even if the one in the executable is not exported.

Between the various MLIR libraries and the main executable I think that parts of LLVM are getting linked at least twice, and maybe more times, which I don't think is supported. Fixing the root cause of this may require some refactoring to ensure that only one copy of LLVM is linked. E.g. If the native target initialization APIs and JitRunnerMain entry point could be moved into one of the MLIR dylibs then mlir-cpu-runner needn't link LLVM directly.

If that's not practical then exporting only the necessary symbols seems like a reasonable first step to get this passing.

awarzynski added a comment.EditedMar 30 2023, 8:36 AM

Thanks for chiming in!

Does removing EnableABIBreakingChecks from the export list actually resolve the ODR violation? As I understand it there would still be two copies of the variable, even if the one in the executable is not exported.

That's a good point - I don't know TBH. I does silence the sanitser, but that doesn't mean that the issue is resolved. @hctim ?

Between the various MLIR libraries and the main executable I think that parts of LLVM are getting linked at least twice, and maybe more times, which I don't think is supported. Fixing the root cause of this may require some refactoring to ensure that only one copy of LLVM is linked. E.g. If the native target initialization APIs and JitRunnerMain entry point could be moved into one of the MLIR dylibs then mlir-cpu-runner needn't link LLVM directly.

If that's not practical then exporting only the necessary symbols seems like a reasonable first step to get this passing.

My preference would be to merge this as is and open a separate GitHub issue to communicate that we should work towards a proper fix like you suggested.

EDIT
Created https://github.com/llvm/llvm-project/issues/61856

Add a reference to the GitHub issue

Update the workaround

As pointed out by @mehdi_amini in [1], libSupport is an implementation detail
of mlir_async_runtime we should be refining the CMake definition of that
library rather than mlir-cpu-runner.

I've tested this latest version with sanitsers enabled and didn't get any
errors. And also, mlir-cpu-runner --host-supports-jit returns true as
expected.

[1] https://github.com/llvm/llvm-project/issues/61856

lhames accepted this revision.Mar 31 2023, 11:43 AM

Looks good to me!

Make sure that the extra CMake logic only applies on Linux (I've used GNU linker specific flags)