Page MenuHomePhabricator

[LLDB] Fix address computation for inline function
ClosedPublic

Authored by johannes on Dec 13 2019, 12:43 PM.

Details

Summary

Fixes PR41237 - SIGSEGV on call expression evaluation when debugging clang

When linking multiple compilation units that define the same functions,
the functions is merged but their debug info is not. This ignores debug
info entries for functions in a non-executable sections; those are
functions that were definitely dropped by the linker.

Diff Detail

Event Timeline

johannes created this revision.Dec 13 2019, 12:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2019, 12:43 PM
Herald added a subscriber: aprantl. · View Herald Transcript
johannes updated this revision to Diff 233856.Dec 13 2019, 12:47 PM
johannes edited the summary of this revision. (Show Details)

typo

Harbormaster completed remote builds in B42485: Diff 233856.
aprantl added inline comments.Dec 13 2019, 3:55 PM
lldb/test/Shell/Expr/TestFunctionAddress.lldb
6 ↗(On Diff #233856)

This would be better suited for a python test. (You can copy the example test as a starting point).

aprantl added inline comments.Dec 13 2019, 3:58 PM
lldb/test/Shell/Expr/TestFunctionAddress.lldb
6 ↗(On Diff #233856)

Thinking about it some more, you are testing for a peculiarity in the DWARF emitted by lld. This would actually be best for an assembler or yaml2obj test that guarantees that that peculiarity is actually there. That said, yaml2obj tests make most sense in combination with lldb-test. When you need a live process to run an expression, like here, they make less sense, since they can only run on one particular platform. So I'm not sure what to recommend.

Can you attach a paste of the DWARF that is emitted by lld and the symbol table and annotate which one should be picked. I am having trouble understanding what the right solution for this fix would be. It makes me nervous to require a symbol in the symbol table since symbol tables can be stripped from binaries.

lldb/source/Expression/IRExecutionUnit.cpp
833–835 ↗(On Diff #233856)

Debug info can be in an executable and the symbol can be stripped. Not sure this is safe to require a symbol?

labath added a subscriber: labath.Dec 16 2019, 1:17 AM

This definitely needs a more targeted fix. Requiring a symbol is not good, and it will be particularly problematic on windows, as there we don't have the equivalent of a .symtab (only .dynsym).

Looking at your example, I'm pretty sure that the DW_AT_object_pointer business is a red herring. That's just the normal way of describing member functions -- you first describe the interface of the function withing the containing DW_TAG_structure_type, and then you (outside of the struct) define the precise details of the method (code location, etc.). The second definition references the first one via DW_AT_specification, and lldb should already know how to read these things -- if not then a fix needs to be made there. But I don't think that's necessary, because the choice of linker should not impact this.

What I actually think is happening is that there are *two* definitions of the "four" function (because its an inline function used from two compilation units). This means the linker has to merge them, but unfortunately it cannot also "merge" the debug info (this is a long standing problem with dwarf). So, what the linker does is set the DW_AT_low_pc of the function which got dropped to zero. Lldb does not detect this, and so it ends up trying to call the address zero, and the expression crashes.

The reason that the choice of linker matters here is because they have different behavior regarding which copy gets dropped, and lld seems to drop the one which lldb tries first. This is also why your fix kind of works -- because there is no symbol at address zero (usually).

So, I think the fix for this should be in DWARF code, to make sure these kinds of functions don't even get considered. Somewhere in SymbolFileDWARF::FindFunctions, or maybe even deeper. One way to detect these stripped functions would be to check that the purported address of the function actually lies within the boundaries of one of the sections of the object file.

After that, we can probably come up with a simpler way to test this too...

johannes updated this revision to Diff 234134.Dec 16 2019, 1:32 PM
johannes retitled this revision from [LLDB] Fix address computation for member function linked with lld to [LLDB] Fix address computation for inline function.
johannes edited the summary of this revision. (Show Details)

different approach (WIP): ignore functions with DW_AT_low_pc = 0

Thanks for the very useful feedback! Now I can finally see why this is happening.

Below is the relevant excerpt of the input DWARF:
one of the DW_AT_low_pc attributes is NULL, which does not happen when I link with GNU ld.

clang lldb/test/Shell/Expr/Inputs/function-address-{main,lib}.cpp -g -fuse-ld=lld -o - |
llvm-dwarfdump -debug-info /dev/stdin

0x00000203:   DW_TAG_subprogram
                DW_AT_low_pc	(0x0000000000201940)
                DW_AT_high_pc	(0x000000000020194f)
                DW_AT_frame_base	(DW_OP_reg6 RBP)
                DW_AT_object_pointer	(0x0000021a)
                DW_AT_specification	(0x00000069 "_ZNK14llvm_namespace10TinyVector4fourEv")

0x0000021a:     DW_TAG_formal_parameter
                  DW_AT_location	(DW_OP_fbreg -8)
                  DW_AT_name	("this")
                  DW_AT_type	(0x0000027a "const TinyVector*")
                  DW_AT_artificial	(true)

0x00000226:     NULL

0x0000053a:   DW_TAG_subprogram
                DW_AT_low_pc	(0x0000000000000000)
                DW_AT_high_pc	(0x000000000000000f)
                DW_AT_frame_base	(DW_OP_reg6 RBP)
                DW_AT_object_pointer	(0x00000551)
                DW_AT_specification	(0x00000309 "_ZNK14llvm_namespace10TinyVector4fourEv")

0x00000551:     DW_TAG_formal_parameter
                  DW_AT_location	(DW_OP_fbreg -8)
                  DW_AT_name	("this")
                  DW_AT_type	(0x00000563 "const TinyVector*")
                  DW_AT_artificial	(true)

0x0000055d:     NULL

I think I found how GDB handles this case:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/dwarf2read.c;h=ecfae68427716333ab060ef1c44b3a25df9d7ac0;hb=HEAD#l14756

Now I have a rough implementation of the same idea.
I can probably come up with a unit test in SymbolFileDWARFTests that
checks that the function address is correct after parsing above DWARF.

Thanks. I think we're getting closer, and I think we can start talking about a more direct way to test this behavior.

This is now definitely, as Adrian called it, a DWARF peculiarity, and so an assembly file with hard-coded dwarf would be much better. Can you take a look at the files in test/Shell/SymbolFile/DWARF and create something similar. It don't think you'll need very complicated dwarf -- it should be sufficient to have two compile units with one function each, and set one of the function's DW_AT_low_pc to zero. The test command could be something like running lldb-test symbols %t -find=function -name=<foo> and checking that it returns just a single entry...

lldb/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2361–2363

I think this check should go into the ResolveFunction call, as that function already does some validation of the function address.

And I still believe that we shouldn't special case address zero, but rather just reject any address that does not map onto a known section (it should be sufficient to replace the addr.IsValid() check in ResolveFunction with addr.IsSectionOffset()). Functions not mapping to any sections are going to be mostly useless anyway, because the first thing that lldb does when resolving an address, is to resolve the module it should search in. It uses the section list to do that, and so if the function does not belong to any section, lldb will never find it, as it will not even get around to asking the Module about it.

johannes updated this revision to Diff 234736.Dec 19 2019, 9:00 AM
  • Use addr.IsSectionOffset() as suggested.
  • Add test that links two copies of a compilation unit and makes sure that lldb only resolves it once.

The fix seems to work when linking an executable, but it does not when creating a shared object file, as demonstrated by the XFAIL test. In the DWARF of the shared object the low_pc of one copy of "foo" is set to zero, which seems to be inside a section so it is considered valid.

$ build/bin/llvm-dwarfdump build/tools/lldb/test/SymbolFile/DWARF/Output/inline-function-address-shared.test.tmp
build/tools/lldb/test/SymbolFile/DWARF/Output/inline-function-address-shared.test.tmp:       file format ELF64-x86-64

.debug_info contents:
0x00000000: Compile Unit: length = 0x0000003d version = 0x0004 abbr_offset = 0x0000 addr_size = 0x08 (next unit at 0x00000041)

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer    ("")
              DW_AT_language    (DW_LANG_C99)
              DW_AT_name        ("inline-function-address.c")
              DW_AT_stmt_list   (0x00000000)
              DW_AT_low_pc      (0x0000000000001270)
              DW_AT_high_pc     (0x0000000000001271)

0x00000026:   DW_TAG_subprogram
                DW_AT_name      ("foo")
                DW_AT_decl_file ("inline-function-address.h")
                DW_AT_decl_line (12)
                DW_AT_prototyped        (true)
                DW_AT_declaration       (true)
                DW_AT_external  (true)

0x0000002d:   DW_TAG_subprogram
                DW_AT_low_pc    (0x0000000000001270)
                DW_AT_high_pc   (0x0000000000001271)
                DW_AT_frame_base        (DW_OP_reg7 RSP)
                DW_AT_specification     (0x00000026 "foo")

0x00000040:   NULL
0x00000041: Compile Unit: length = 0x0000003d version = 0x0004 abbr_offset = 0x0030 addr_size = 0x08 (next unit at 0x00000082)

0x0000004c: DW_TAG_compile_unit
              DW_AT_producer    ("")
              DW_AT_language    (DW_LANG_C99)
              DW_AT_name        ("inline-function-address.c")
              DW_AT_stmt_list   (0x0000003b)
              DW_AT_low_pc      (0x0000000000000000)
              DW_AT_high_pc     (0x0000000000000001)

0x00000067:   DW_TAG_subprogram
                DW_AT_name      ("foo")
                DW_AT_decl_file ("inline-function-address.h")
                DW_AT_decl_line (12)
                DW_AT_prototyped        (true)
                DW_AT_declaration       (true)
                DW_AT_external  (true)

0x0000006e:   DW_TAG_subprogram
                DW_AT_low_pc    (0x0000000000000000)
                DW_AT_high_pc   (0x0000000000000001)
                DW_AT_frame_base        (DW_OP_reg7 RSP)
                DW_AT_specification     (0x00000067 "foo")

0x00000081:   NULL

It is sad that we can't tell if a DW_AT_low_pc with a value of zero is valid or not. Some shared libraries might be able to have a function at address zero, so we need to be careful here. My proposed fix above will check the section that contains the low PC to see if it has execute permissions, so that should cover all cases (executables and shared libraries). Not sure if there is more we can do.

lldb/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2284–2285

Checking if the address is section offset might not be enough. All darwin systems have a __PAGEZERO segment that covers the first 4GB for 64 bit files and the first 4096 bytes for 32 bit fiules. This section has no read, no write no execute permissions. So maybe grab the section and verify it has read + execute permissions? That way data symbol matches won't trigger.

if (auto section_sp = addr.GetSection()) {
  if (section_sp->GetPermissions() & ePermissionsExecutable) {
    sc_list.Append(sc);
    return true;
  }
}
clayborg requested changes to this revision.Dec 19 2019, 2:13 PM
This revision now requires changes to proceed.Dec 19 2019, 2:13 PM

BTW: is used to be that both DW_AT_low_pc and DW_AT_high_pc would be set to zero when a function was dead stripped. This was back when both the low and high pc used DW_FORM_addr (a file address). But then DWARF changed such that DW_AT_high_pc could be encoded as a data form: DW_FORM_data1, DW_FORM_data2, DW_FORM_data4, or DW_FORM_data8. This is used to mean it is an offset from the low PC. Seems the linkers now didn't have a relocation for the DW_AT_high_pc so they couldn't zero it out. This is sad because we can end up with many functions at address zero that didn't get linked, and if zero is a valid address, then our DWARF contains a bunch of useless info that only hides which function is the real function for address zero.

johannes updated this revision to Diff 234795.Dec 19 2019, 3:19 PM
johannes edited the summary of this revision. (Show Details)

check if the function's section is executable

johannes marked an inline comment as done.Dec 19 2019, 3:20 PM
johannes added inline comments.
lldb/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
2284–2285

Yep, that works nicely!

clayborg accepted this revision.Dec 19 2019, 3:26 PM

Looks good, thanks for tracking this down!

This revision is now accepted and ready to land.Dec 19 2019, 3:26 PM
This revision was automatically updated to reflect the committed changes.

I'm having some troubles fixing this build failure:
http://lab.llvm.org:8011/builders/lldb-x86_64-debian/builds/1866/

ninja check-lldb-api fails but when I try to execute the offending test locally with a command equivalent to the one in the buildbot log, I get below error. I've tried to rebuild lldb*.so etc without success.

$ llvm-project/lldb/test/API/dotest.py --arch=x86_64 -s b/lldb-test-traces -S nm -u CXXFLAGS -u CFLAGS --executable b/bin/lldb --compiler b/bin/clang --dsymutil b/bin/dsymutil --filecheck b/bin/FileCheck --env ARCHIVER=/usr/bin/ar --env OBJCOPY=/usr/bin/objcopy --env LLVM_LIBS_DIR=b/lib --build-dir b/lldb-test-build.noindex --lldb-module-cache-dir b/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir b/lldb-test-build.noindex/module-cache-clang/lldb-api llvm-project/lldb/packages/Python/lldbsuite/test/functionalities/dead-strip -p TestDeadStrip.py
LLDB library dir: /home/builduser/build-ld/bin
LLDB import library dir: /home/builduser/build-ld/bin
lldb version 10.0.0 (git@github.com:llvm/llvm-project revision ea8a86cc8a18fd001e9670705dae3b59f6a4d974)
  clang revision 92211bf0f15ba46b5eeb88b7ea580ff539dcdd4e
  llvm revision 92211bf0f15ba46b5eeb88b7ea580ff539dcdd4e
Traceback (most recent call last):
  File "/home/builduser/build-ld/lib/python3.8/site-packages/lldb/__init__.py", line 38, in <module>
    import _lldb
ModuleNotFoundError: No module named '_lldb'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "llvm-project/lldb/test/API/dotest.py", line 7, in <module>
    lldbsuite.test.run_suite()
  File "/home/builduser/llvm-project/lldb/packages/Python/lldbsuite/test/dotest.py", line 956, in run_suite
    import lldb
  File "/home/builduser/build-ld/lib/python3.8/site-packages/lldb/__init__.py", line 41, in <module>
    from . import _lldb
ImportError: /home/builduser/build-ld/lib/python3.8/site-packages/lldb/_lldb.so: undefined symbol: _ZN5clang17ExternalASTSource2IDE

There's also another build failure on Windows, I will fix that by adding REQUIRES: linux since the tests are using ld.lld:
http://lab.llvm.org:8011/builders/lldb-x64-windows-ninja/builds/11870/

I'm having some troubles fixing this build failure:
http://lab.llvm.org:8011/builders/lldb-x86_64-debian/builds/1866/

ninja check-lldb-api fails but when I try to execute the offending test locally with a command equivalent to the one in the buildbot log, I get below error. I've tried to rebuild lldb*.so etc without success.

$ llvm-project/lldb/test/API/dotest.py --arch=x86_64 -s b/lldb-test-traces -S nm -u CXXFLAGS -u CFLAGS --executable b/bin/lldb --compiler b/bin/clang --dsymutil b/bin/dsymutil --filecheck b/bin/FileCheck --env ARCHIVER=/usr/bin/ar --env OBJCOPY=/usr/bin/objcopy --env LLVM_LIBS_DIR=b/lib --build-dir b/lldb-test-build.noindex --lldb-module-cache-dir b/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir b/lldb-test-build.noindex/module-cache-clang/lldb-api llvm-project/lldb/packages/Python/lldbsuite/test/functionalities/dead-strip -p TestDeadStrip.py
LLDB library dir: /home/builduser/build-ld/bin
LLDB import library dir: /home/builduser/build-ld/bin
lldb version 10.0.0 (git@github.com:llvm/llvm-project revision ea8a86cc8a18fd001e9670705dae3b59f6a4d974)
  clang revision 92211bf0f15ba46b5eeb88b7ea580ff539dcdd4e
  llvm revision 92211bf0f15ba46b5eeb88b7ea580ff539dcdd4e
Traceback (most recent call last):
  File "/home/builduser/build-ld/lib/python3.8/site-packages/lldb/__init__.py", line 38, in <module>
    import _lldb
ModuleNotFoundError: No module named '_lldb'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "llvm-project/lldb/test/API/dotest.py", line 7, in <module>
    lldbsuite.test.run_suite()
  File "/home/builduser/llvm-project/lldb/packages/Python/lldbsuite/test/dotest.py", line 956, in run_suite
    import lldb
  File "/home/builduser/build-ld/lib/python3.8/site-packages/lldb/__init__.py", line 41, in <module>
    from . import _lldb
ImportError: /home/builduser/build-ld/lib/python3.8/site-packages/lldb/_lldb.so: undefined symbol: _ZN5clang17ExternalASTSource2IDE

I've already fixed that (the "problem" was that the test was passing now), but for future reference, the problem here appears to be that you're doing a shared library build (-DBUILD_SHARED_LIBS=ON). In that case you'll need to set LD_LIBRARY_PATH to for python to find dependent libraries correctly.

There's also another build failure on Windows, I will fix that by adding REQUIRES: linux since the tests are using ld.lld:
http://lab.llvm.org:8011/builders/lldb-x64-windows-ninja/builds/11870/

A better fix for that would be to explicitly specify the target to llc (llc -mtriple x86_64-pc-linux), as ld.lld works fine on windows too. You should add REQUIRES: x86 to catch the cases where people don't enable the X86 target, but that's better since everyone can enable the x86 target (and most do), but not everyone runs a linux machine...

BTW: is used to be that both DW_AT_low_pc and DW_AT_high_pc would be set to zero when a function was dead stripped. This was back when both the low and high pc used DW_FORM_addr (a file address). But then DWARF changed such that DW_AT_high_pc could be encoded as a data form: DW_FORM_data1, DW_FORM_data2, DW_FORM_data4, or DW_FORM_data8. This is used to mean it is an offset from the low PC. Seems the linkers now didn't have a relocation for the DW_AT_high_pc so they couldn't zero it out. This is sad because we can end up with many functions at address zero that didn't get linked, and if zero is a valid address, then our DWARF contains a bunch of useless info that only hides which function is the real function for address zero.

One solution, which we do in Sony, is to make the linker fix up undefined references to be -1 instead of 0 (at least, in the .debug_* sections). That's more obviously an invalid address. Doesn't help with existing objects in the wild but I'd like to keep that idea in the air as a forward evolutionary step.

BTW: is used to be that both DW_AT_low_pc and DW_AT_high_pc would be set to zero when a function was dead stripped. This was back when both the low and high pc used DW_FORM_addr (a file address). But then DWARF changed such that DW_AT_high_pc could be encoded as a data form: DW_FORM_data1, DW_FORM_data2, DW_FORM_data4, or DW_FORM_data8. This is used to mean it is an offset from the low PC. Seems the linkers now didn't have a relocation for the DW_AT_high_pc so they couldn't zero it out. This is sad because we can end up with many functions at address zero that didn't get linked, and if zero is a valid address, then our DWARF contains a bunch of useless info that only hides which function is the real function for address zero.

One solution, which we do in Sony, is to make the linker fix up undefined references to be -1 instead of 0 (at least, in the .debug_* sections). That's more obviously an invalid address. Doesn't help with existing objects in the wild but I'd like to keep that idea in the air as a forward evolutionary step.

I second this motion and would love to see this in more linkers.