Page MenuHomePhabricator

[LLD][ELF][DebugInfo] llvm-symbolizer shows incorrect source line info if --gc-sections used
Needs ReviewPublic

Authored by avl on Mar 19 2019, 10:34 AM.

Details

Summary
 [LLD][ELF][DebugInfo] llvm-symbolizer shows incorrect source line info if --gc-sections used for building binary.
    
 This patch is a fix for https://bugs.llvm.org/show_bug.cgi?id=41124 .
    
 When -Wl,--gc-sections is used : unreferenced .text sections could be removed,
 but debug info related to these deleted sections left unchanged.
 Address ranges for this not needed debug info could overlap with
 address ranges for correct code. In such case symbolizer shows
 line information incorrectly. 

 The real root of the problem is that irrelevant debug information left in the 
 binary. So the proper fix would be to delete that information completely. It would 
 probably be necessary to  store debug information in COMDAT sections so that 
 linker would be able to remove not only unreferenced .text section but related debug 
 info also.

That fix implements workaround for the problem. It resolves relocations related to
debug info for deleted code so that they do not clash with real code. To achieve this
it set address ranges for such unneeded debug info out of module scope.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jhenderson added inline comments.Mar 28 2019, 6:05 AM
lld/ELF/InputSection.cpp
836

This will patch any non-alloc section that is relocated, not just .debug* sections. I think it should be scoped to those kinds of sections only, as it could break the behaviour for other unknown sections that people may have.

840

This bit doesn't quite make sense. Perhaps "We use UINT64_MAX-1 because..."

lld/test/ELF/gc-sections-debuginfo.s
2

General nit throughout this test: you are using a mixture of '-' and '--' for long switches, which is a bit jarring. Please switch to using '--' throughout, if possible.

This test should test other debug section patching too, as the problem is not specific to .debug_info. The ones I'm thinking of in particular are .debug_types, .debug_ranges, .debug_line and .debug_aranges. Since the behaviour is actually patching of any debug section, perhaps it would be better to just call the section .debug_foo or similar?

5

I don't know what the common format in LLD tests is off the top of my head, but I find it easier if the '|' is on the following line:

# RUN: echo .... \
# RUN:   | llvm-mc ...
7

Nit: spurious extra space between -sections and -section-data

12

Also applies above.

15

Replace:
"When -Wl,--gc-sections used - linker ..."
with:
"When --gc-sections is used, the linker ..."

16

sections. But their... -> sections, but their

Use "debug data" instead of "debug_info" throughout here, because .debug_info is just one section in the debug data.

18

address'es -> addresses

19

out of module -> out of the module

20

relocation from -> relocations from the

21

to deleted section .text -> to a deleted .text section
into -> to

grimar added inline comments.Mar 28 2019, 6:24 AM
lld/ELF/InputSection.cpp
836

Hmmm... I do not agree here. I think if the user's section assumes that linker resolves the relocation to a removed section as a zero, that does not seem the correct assumption to me. I doubt we should care that we switched the behavior here (or at least we probably want to know about such cases before explicitly supporting them, I think).

avl marked an inline comment as done.Mar 28 2019, 6:35 AM
avl added inline comments.
lld/ELF/InputSection.cpp
836

In some previous version of the patch there was a check for Name.startswith(".debug") which I was asked to remove because "checking the symbol name is very hacky". Would it be OK to return it here ? Or probably solution similar to https://reviews.llvm.org/D54747 would be OK ? (create a flag Debug, set it Debug = Name.startswith(".debug"), and check it here)

avl updated this revision to Diff 192656.Mar 28 2019, 9:26 AM

addressed all comments.

jhenderson added inline comments.Mar 28 2019, 9:55 AM
lld/ELF/InputSection.cpp
836

I don't feel strongly about it, so if there's opposition, what is here is probably fine, I think.

837

Nit: the -> a

lld/test/ELF/gc-sections-debuginfo.s
5

Could you indent this like you were before, please? Same below.

16

info left -> data is left

17

overlapp -> overlap

Apologies that I didn't check the information on the original bug. The GDB behaviour does look reasonable, thanks.

lld/ELF/InputSection.cpp
836

Checking the section name (not symbol name) is not *hacky* in this case. Section names that start with the prefix ".debug" are reserved "for the system". Checking that the section name starts with the prefix ".debug" is the correct ELF way to determine if it is a debug information section.

avl marked an inline comment as done.Mar 28 2019, 10:27 AM
avl added inline comments.
lld/test/ELF/gc-sections-debuginfo.s
5

I did not get it. What exactly should be done ?

this :

  1. RUN: echo '.section .text,"a"; .byte 0; .section .debug_foo,"",@progbits; .quad .text' \
  2. RUN: | llvm-mc --filetype=obj --arch=x86-64 - -o %t

should be change into this ?

  1. RUN: echo '.section .text,"a"; .byte 0; .section .debug_foo,"",@progbits; .quad .text' | \
  2. RUN: llvm-mc --filetype=obj --arch=x86-64 - -o %t
avl updated this revision to Diff 192691.Mar 28 2019, 11:34 AM

addressed all comments, except indentation one, which I did not understand yet.

grimar added inline comments.Mar 29 2019, 1:19 AM
lld/ELF/InputSection.cpp
836

But testing the symbol name inside a hot relocation loop and changing the behavior depending on that is itself a bit hacky, isn't?
My main point I think is that if there is a chance we can just do not do it - I would try it. If that will not work for some reason - then it needs a good comment and also a test case explaining why the generic approach does not work.

lld/test/ELF/gc-sections-debuginfo.s
5

I think James wanted you to add a double space (identation) before llvm-mc:

# RUN: echo '.section .text,"a"; .byte 0;  .section .debug_foo,"",@progbits; .quad .text' \
# RUN:   | llvm-mc --filetype=obj --arch=x86-64 - -o %t
jhenderson added inline comments.Mar 29 2019, 2:55 AM
lld/ELF/InputSection.cpp
837

Sorry, one more comment here, now that I read it again. Change:
"... (--gc-sections) ..." to ", e.g. due to --gc-sections,"

COMDATs are removed even without --gc-sections, for example.

lld/test/ELF/gc-sections-debuginfo.s
5

Yes, exactly this. It makes it a little easier to read this as a continuation.

avl updated this revision to Diff 192807.Mar 29 2019, 4:57 AM

added indentation and corrected sections string.

Thanks. I have no more comments. I think this would be a good thing to do, but it definitely needs some wider agreement from @ruiu and the other LLD contributors.

ruiu added inline comments.Apr 8 2019, 11:33 PM
lld/ELF/InputSection.cpp
842

Where in this document can I find a description why UINT64_MAX is special? Is using UINT64_MAX-1 as an "invalid value" just a local convention of lld (i.e. are you inventing a new notion) or a common practice? Is there any more common value representing an invalid address than UINT64_MAX-1?

avl marked an inline comment as done.Apr 8 2019, 11:41 PM
avl added inline comments.
lld/ELF/InputSection.cpp
842

It is based on the comment by @jhenderson in start of this thread. He mentioned that -1 has issues in .debug_ranges. I would find the real description of using -1 for .debug_ranges and put here...

ruiu added inline comments.Apr 8 2019, 11:45 PM
lld/ELF/InputSection.cpp
842

It's fine if -1 has a special meaning, but it doesn't necessarily mean -2 is the second best. For example, why don't you use 0? What's special about -2? Or is it just arbitrary?

avl marked an inline comment as done.Apr 8 2019, 11:53 PM
avl added inline comments.
lld/ELF/InputSection.cpp
842

We need to prevent address clashing for debug info referenced to valid sections and to deleted sections. To achieve that it is necessary to set an address which is out of module scope. So the idea is to set maximal allowed address value. UINT64_MAX is maximal value but used by .debug_aranges. UINT64_MAX -1 is the next maximal value.

address range 0...0+LENGTH could clash with existing range.
address range UINT64_MAX -1...UINT64_MAX -1+LENGTH is unlikely to be clashed.

ruiu added a comment.Apr 9 2019, 12:56 AM

I thought that address 0 is a good candidate to represent an invalid address, as nullptr is usually represented by address 0. I wonder why we want to use UINT64_MAX-2 instead of 0.

MaskRay requested changes to this revision.EditedApr 9 2019, 1:57 AM

lld sets x86_64 image base to 0x200000.
A .debug_info associated with a non-discarded section has DW_AT_low_pc > 0x200000
A .debug_info associated with a discarded section has DW_AT_low_pc = 0. If its size is larger than 0x200000 => DW_AT_high_pc > image_base => the two debug info entries overlap => llvm-symbolizer locates a DIE that's not what the user expects => unexpected filename (not_used.cpp)

If you make ".rept 2000000; nop; .endr" larger, you'll notice bfd linked executable has the same problem (its image base is 0x400000).

InputSection.cpp:833
Target->relocateOne(BufLoc, Type, SignExtend64<Bits>(Sym.getVA(Addend)));

I think the current handling in lld is quite good, no need to special case on if (!Sym.getOutputSection()). The same problem can be reproduced for ld.bfd and gold.

I think the right fix is to improve the heuristics used in llvm-symbolizer, rather than change lld. llvm-symbolizer may deal with executables linked by ld.bfd and gold.

One change we can make is that, in DWARFDebugAranges::extract, if LowPC is zero and the file is not a relocatable object, don't call appendRange(CUOffset, LowPC, HighPC);.

This revision now requires changes to proceed.Apr 9 2019, 1:57 AM

A workaround is to set a larger image base with --image-base=.

One change we can make is that, in DWARFDebugAranges::extract, if LowPC is zero and the file is not a relocatable object, don't call appendRange(CUOffset, LowPC, HighPC);.

I'm not sure if making zero DW_AT_low_pc special for executable/DSO in DWARFDebugAranges is a good idea.

avl added a comment.Apr 9 2019, 3:55 AM

I thought that address 0 is a good candidate to represent an invalid address, as nullptr is usually represented by address 0. I wonder why we want to use UINT64_MAX-2 instead of 0.

using 0 in this case could lead to following problems :

  1. on some platforms 0 is a valid address. i.e. image base starts from 0. In that case no way to differentiate valid address from invalid one just by checking for 0.
  2. if size of deleted section is long enough then debug range could clash with valid section address range. That leads to incorrect debug info selecting. As written in another message :

lld sets x86_64 image base to 0x200000.
A .debug_info associated with a non-discarded section has DW_AT_low_pc > 0x200000
A .debug_info associated with a discarded section has DW_AT_low_pc = 0. If its size is larger than 0x200000 => DW_AT_high_pc > image_base => the two debug info entries overlap => llvm->symbolizer locates a DIE that's not what the user expects => unexpected filename (not_used.cpp)

Using UINT64_MAX-1 prevents from above problems and allow to select correct DWARF entry.

in other words: using 0 as start of address range could lead to overlapping with valid address range.
using UINT64_MAX-1 as start of address range makes address range of deleted section be outside of module address space and would not clash with existing ranges.

avl added a comment.Apr 9 2019, 4:08 AM

I think the current handling in lld is quite good, no need to special case on if (!Sym.getOutputSection()). The same problem can be reproduced for ld.bfd and gold.

I think the right fix is to improve the heuristics used in llvm-symbolizer, rather than change lld. llvm-symbolizer may deal with executables linked by ld.bfd and gold.

But there is no way to heuristically detect this case which works generally. On some platforms image_base could start from 0. llvm-symbolizer just does not have enough information to decide whether address range correct or not. At the same time linker does have such information. So it looks correct to do that processing in linker. Linker knows that section deleted and able to set not matching address while resolving relocations pointed to this section.

Good general solution would be to delete not used debug info.

But for present time when that information exist - using UINT_MAX64-1 is a good way to resolve ambiquity. It is better than some heuristic which does not work for all cases.

avl added a comment.Apr 9 2019, 4:32 AM

I'm not sure if making zero DW_AT_low_pc special for executable/DSO in DWARFDebugAranges is a good idea.

That could be seen not as special value but as a value which does not clash with other ranges. For example Linker could(as an imaginary processing) assign address to the section and delete it after address assigned. In that case relocation could be resolved into this not overlapped area. Using UINT64_MAX-1 is similar to that handling. The difference is only that it simplified by taking maximal address value instead of assign an unused area.

__libc_csu_init is in libc.a(elf-init.o) and has no debug info. It should symbolize to ?? but not_used.cpp is incorrectly returned. ?? -> not_used.cpp

A more serious problem is when a main.cpp symbol incorrectly resolves to not_used.cpp. I created D60470 to fix that.

avl added a comment.Apr 9 2019, 12:31 PM

__libc_csu_init is in libc.a(elf-init.o) and has no debug info. It should symbolize to ?? but not_used.cpp is incorrectly returned. ?? -> not_used.cpp

I did not see that problem if this fix applied. Everything looks exactly like you expect:

nm res.out
0000000000201100 T __libc_csu_init

U __libc_start_main

00000000002010f0 T main

llvm-symbolizer -obj=res.o 0x0000000000201100
__libc_csu_init
??:0:0 <<<<<<<<<<<<<<<<<<<<<<<<<<<

llvm-symbolizer -obj=res.o 0x00000000002010f0
main
main.cpp:2:4 <<<<<<<<<<<<<<<<<<<<<<<<<<<

Did I correctly understand the problem which you are talking about?

A more serious problem is when a main.cpp symbol incorrectly resolves to not_used.cpp. I created D60470 to fix that.

Would solution from D60470 work correctly for platform where 0 is a valid virtual address ?

MaskRay resigned from this revision.Apr 9 2019, 8:06 PM
In D59553#1460304, @avl wrote:

__libc_csu_init is in libc.a(elf-init.o) and has no debug info. It should symbolize to ?? but not_used.cpp is incorrectly returned. ?? -> not_used.cpp

I did not see that problem if this fix applied. Everything looks exactly like you expect:

nm res.out
0000000000201100 T __libc_csu_init

U __libc_start_main

00000000002010f0 T main

llvm-symbolizer -obj=res.o 0x0000000000201100
__libc_csu_init
??:0:0 <<<<<<<<<<<<<<<<<<<<<<<<<<<

llvm-symbolizer -obj=res.o 0x00000000002010f0
main
main.cpp:2:4 <<<<<<<<<<<<<<<<<<<<<<<<<<<

Did I correctly understand the problem which you are talking about?

A more serious problem is when a main.cpp symbol incorrectly resolves to not_used.cpp. I created D60470 to fix that.

Would solution from D60470 work correctly for platform where 0 is a valid virtual address ?

See my comment at https://reviews.llvm.org/D60470#1460670.

MaskRay added inline comments.Apr 9 2019, 8:10 PM
lld/ELF/InputSection.cpp
842

I think -1 also works, no need to use -2. -1 is used somewhere in llvm/lib/DebugInfo/DWARF/ to indicate "the value is absent", exactly what we want to express here.

ruiu added a comment.Apr 9 2019, 11:39 PM
In D59553#1459554, @avl wrote:

I thought that address 0 is a good candidate to represent an invalid address, as nullptr is usually represented by address 0. I wonder why we want to use UINT64_MAX-2 instead of 0.

using 0 in this case could lead to following problems :

  1. on some platforms 0 is a valid address. i.e. image base starts from 0. In that case no way to differentiate valid address from invalid one just by checking for 0.

But this is also true to UINT64_MAX-2 because you can put your code or data there, no?

  1. if size of deleted section is long enough then debug range could clash with valid section address range. That leads to incorrect debug info selecting. As written in another message :

lld sets x86_64 image base to 0x200000.
A .debug_info associated with a non-discarded section has DW_AT_low_pc > 0x200000
A .debug_info associated with a discarded section has DW_AT_low_pc = 0. If its size is larger than 0x200000 => DW_AT_high_pc > image_base => the two debug info entries overlap => llvm->symbolizer locates a DIE that's not what the user expects => unexpected filename (not_used.cpp)

Using UINT64_MAX-1 prevents from above problems and allow to select correct DWARF entry.

in other words: using 0 as start of address range could lead to overlapping with valid address range.
using UINT64_MAX-1 as start of address range makes address range of deleted section be outside of module address space and would not clash with existing ranges.

avl added a comment.Apr 10 2019, 3:12 AM
In D59553#1459554, @avl wrote:
  1. on some platforms 0 is a valid address. i.e. image base starts from 0. In that case no way to differentiate valid address from invalid one just by checking for 0.

But this is also true to UINT64_MAX-2 because you can put your code or data there, no?

yes. that`s true. But using UINT64_MAX-1 to assign addresses looks more like an artificial case. It is natural to start assign addresses from start of address space(which could be 0). It is much less common practice(AFAIK) to assign addresses in the very end of address space.

In the very first version of that path there was a code which analyzes sections for assigned addresses and select an address which does not clash with already assigned module`s ranges. It was suggested to simplify this processing to use UINT64_MAX-1 instead of analyzing address ranges. Assuming that it is really rare case when UINT64_MAX-1 would clash with existing module`s ranges. So using UINT64_MAX-1 is "simple, fast, good enough" solution which works better than 0. i.e. there would be much less cases when addresses would clash for UINT64_MAX-1 than for 0.

avl marked an inline comment as done.Apr 10 2019, 3:20 AM
avl added inline comments.
lld/ELF/InputSection.cpp
842

-1 used as a special value indicating "base address selection entry" in debug range list processing. Since we do not need to indicate "base address selection entry" here then it would probably be better to use -2.

avl added a comment.Apr 12 2019, 12:46 AM

@MaskRay Hi, Since you changed status of review into "Needs Revision" - What changes do you think should be done ?

avl requested review of this revision.Apr 15 2019, 1:56 PM
MaskRay added a comment.EditedApr 15 2019, 10:23 PM
In D59553#1463805, @avl wrote:

@MaskRay Hi, Since you changed status of review into "Needs Revision" - What changes do you think should be done ?

My thoughts about this change are split. I wanted to reset the status of review into "Normal" but I didn't how to do that :( I didn't want to change it to "Accepted" so I chose "Resign". It seems it didn't restore the status back to normal and only you can do that...

What I don't like

  • complexity. Two lines are two lines.
  • The demonstration case is a bit contrived. Filling in a garbage collected text section with zeros/nops to overlap with another text section doesn't seem common. The problems aren't new but lld's layout mitigates many(most?) of the problems. See https://reviews.llvm.org/D60470#1462246
  • Existing tools have learned various heuristics to avoid 0 DW_AT_low_pc collision, but I am not sure what they'd do if those relocations resolve to -1. llvm-symbolizer is certainly good (I have checked a few examples - I'm learning it recently), but what about other tools?

What I like

  • -1 or -2 makes the resolved relocation outstanding -> a different value from 0 makes it more obvious. Avoiding -1 just because "base address selection entry" uses -1 is not a strong argument IMHO. As long as there ranges are invalid (or valid but unrealistic to collide with good ones) we are good.
MaskRay added a reviewer: MaskRay.
avl added a comment.Apr 15 2019, 10:49 PM

@ruiu What do you think about using UINT64_MAX-1 as address to which relocations for deleted sections would be resolved ? Does it make sense ?

avl added a comment.Apr 15 2019, 11:25 PM

Existing tools have learned various heuristics to avoid 0 DW_AT_low_pc collision, but I am not sure what they'd do if those relocations resolve to -1. llvm-symbolizer is certainly good (I have checked a few >examples - I'm learning it recently), but what about other tools?

addr2line started to behave correctly :

[LD]clang++ -gdwarf-5 -O -Wl,--gc-sections -fuse-ld=ld not_used.o main.o -o res.out
addr2line -e res.out 0x400497
/home/avl/bugs/gc_debuginfo/not_used.cpp:2 <<<<<<<<<<<<<<<<<<<<<<<<<< incorrect if built by ld

[LLD]clang++ -gdwarf-5 -O -Wl,--gc-sections -fuse-ld=lld not_used.o main.o -o res.out
addr2line -e res.out 0x2010e7
/home/avl/bugs/gc_debuginfo/main.cpp:3 <<<<<<<<<<<<<<<<<<<<<<< correct if built by lld with this patch

avl added a comment.Apr 22 2019, 7:21 AM

I find a previous discussion at https://bugs.llvm.org/show_bug.cgi?id=37212 and https://reviews.llvm.org/D46502

this is another case(other than -ffunction-sections) where using UINT64_MAX-1 as address value for debug info related to discarded sections would help to dump debug info more correct.

In D59553#1482151, @avl wrote:

ping

Can I have your answers to my questions at https://reviews.llvm.org/D59553#1467876? I have said the example is contrived, and the zero DW_TAG_low_pc issue causes more problems to ld.bfd/gold linked modules than to lld linked modules.

avl added a comment.Apr 29 2019, 10:56 AM

Excuse me, I did not realize that those were questions. Please check my answers :

complexity. Two lines are two lines.
That is what I do not understand. Why two-lines patch considered to be complex?
The demonstration case is a bit contrived. Filling in a garbage collected text section with zeros/nops to overlap with another text section doesn't seem common. The problems aren't new but lld's layout mitigates many(most?) of the problems. See https://reviews.llvm.org/D60470#1462246
I do not think that the demonstration case is a contrived. The problem probably happened  less frequently for common x86 case when image-base starts from big values. But there are other platforms/compilation languages/linking options/linkers where this problem happens more often. The test case shows the problem. I do not think it would be better if the test case would be bigger and more complex to show some realistic common x86 case.

I also do not think that zero DW_TAG_low_pc issue causes more problems to ld.bfd/gold linked modules than to lld linked modules. There many combinations of platforms/compilation languages/linking options/linkers  which could lead into the described problem. In various combinations zero DW_TAG_low_pc issue would be bigger problem for one linker and smaller problem for another(because of different linker behavior).
Existing tools have learned various heuristics to avoid 0 DW_AT_low_pc collision, but I am not sure what they'd do if those relocations resolve to -1. llvm-symbolizer is certainly good (I have checked a few examples - I'm learning it recently), but what about other tools?

addr2line started to behave correctly :

[LD]clang++ -gdwarf-5 -O -Wl,--gc-sections -fuse-ld=ld not_used.o main.o -o res.out
addr2line -e res.out 0x400497
/home/avl/bugs/gc_debuginfo/not_used.cpp:2 <<<<<<<<<<<<<<<<<<<<<<<<<< incorrect if built by ld

[LLD]clang++ -gdwarf-5 -O -Wl,--gc-sections -fuse-ld=lld not_used.o main.o -o res.out
addr2line -e res.out 0x2010e7
/home/avl/bugs/gc_debuginfo/main.cpp:3 <<<<<<<<<<<<<<<<<<<<<<< correct if built by lld with this patch

MaskRay added a comment.EditedMay 6 2019, 12:26 AM

I've some previous discussions PR37212 PR37891 etc. -2 or -1 may be benign to some non-allocated sections like .debug.*, but may not be good to others. The consumers may get used to treat 0 DW_AT_low_pc special but the patch may make some consumers code more special values. lld doesn't implement the CB_PRETEND logic in GNU linkers:

Maybe you can raise an issue on https://sourceware.org/bugzilla , asking if the relocation can be resolved to -2 if the PRETEND logic fails, or asking if addr2line should be fixed instead. I raised some issues in the weekend and the maintainers seem super responsive.

I think it is quite possible that we just need the llvm-symbolizer fix D60470 and an addr2line fix to make everything work out. Debuggers have had the heuristics for years.

But there are other platforms/compilation languages/linking options/linkers where this problem happens more often.

If you happen to know one combination, please elaborate :)

I also do not think that zero DW_TAG_low_pc issue causes more problems to ld.bfd/gold linked modules than to lld linked modules.

The problem probably happened less frequently for common x86 case when image-base starts from big values.

The image base doesn't matter (is zero) for -shared/-pie. I've shared my thoughts why this doesn't likely cause problems to lld. In lld, the PF_R PT_LOAD is placed before the PF_R|PF_X PT_LOAD (text segment). If you have a compilation unit that is sufficiently large (large DW_AT_high_pc, 0 DW_AT_low_pc), it is reasonable to have a sufficiently large PF_R PT_LOAD. Then the range will not collide with a good compilation unit with large DW_AT_low_pc.

On the GNU linkers' side, ld.bfd -z separate-code is a recent addition and it doesn't move the large .rodata etc before .text. gold hasn't implemented -z separate-code. They are more vulnerable to the issue.

avl added a comment.May 6 2019, 9:29 AM

I've some previous discussions PR37212 PR37891 etc. -2 or -1 may be benign to some non-allocated sections like .debug.*, but may not be good to others.

Could you explicitly describe that case, please ? i.e. When using UINT64_MAX-1 as start of address range for relocations pointing to deleted code would not be good for others ?

The consumers may get used to treat 0 DW_AT_low_pc special but the patch may make some consumers code more special values.

that would be a error. Treating 0 DW_AT_low_pc as a special value would not be correct because there is no such documented special value. Consumers should not rely that linker resolves the relocation to a removed section as a zero.

this patch does not add additional special value. consumer code would not be required to add more special values.

lld doesn't implement the CB_PRETEND logic in GNU linkers:

gold CB_PRETEND https://sourceware.org/git/?p=binutils-gdb.git;a=blob;hb=9d6d4be89d12747a92629ed1bde1d423e2831de1;f=gold/target-reloc.h#l414
ld.bfd action_discarded PRETEND https://sourceware.org/git/?p=binutils-gdb.git;a=blob;hb=9d6d4be89d12747a92629ed1bde1d423e2831de1;f=bfd/elflink.c#l10851

Maybe you can raise an issue on https://sourceware.org/bugzilla , asking if the relocation can be resolved to -2 if the PRETEND logic fails, or asking if addr2line should be fixed instead. I raised some issues in the weekend and the >maintainers seem super responsive.

PRETEND case is a related but separate issue. PRETEND is about "assign address range of selected COMDAT section to the references to deleted COMDAT section".
This patch is about - "assign not-clashing address value to the references for the deleted code".
I can raise an issue there but I do not think this discussion should depend on https://sourceware.org/bugzilla.

I think it is quite possible that we just need the llvm-symbolizer fix D60470 and an addr2line fix to make everything work out. Debuggers have had the heuristics for years.

I do not agree with this.
llvm-symbolizer and addr2line just do not have enough information to properly select correct address range.

for the platform where image-base starts from 0 - D60470 will not help to select correct address range. Right ?

I think there are three levels of solutions for this problem :

  1. Best one: Delete wrong debug info. In this case the problem just would not arise.
  2. Good enough: Set not clashing address range for code pointing to deleted sections while resolving relocations in linker. linker knows which address range is correct. So it could unambiguously assign proper value.
  3. Less powerful: implement heuristics in llvm-symbolizer and addr2line. heuristics do not work for all cases. Thus wrong behavior would still exists.

But there are other platforms/compilation languages/linking options/linkers where this problem happens more often.

If you happen to know one combination, please elaborate :)

sure.

  1. x86 platform and -image-base=0 makes this problem to happen much more often.
  2. there are quite many platforms which already has -image-base starting from zero. This is often used for embedded systems. as an example - please check message from @jhenderson https://reviews.llvm.org/D60470#1462375 . In such case this problem happens much more often.
  1. there are programming languages not using exceptions intensively. So they do not create eh_frame section at all or have it very small. This would lead to not shift start of text section much (as in your explanation why the problem is less important for lld).

The image base doesn't matter (is zero) for -shared/-pie. I've shared my thoughts why this doesn't likely cause problems to lld.
In lld, the PF_R PT_LOAD is placed before the PF_R|PF_X PT_LOAD (text segment). If you have a compilation unit that is sufficiently
large (large DW_AT_high_pc, 0 DW_AT_low_pc), it is reasonable to have a sufficiently large PF_R PT_LOAD.
Then the range will not collide with a good compilation unit with large DW_AT_low_pc.

On the GNU linkers' side, ld.bfd -z separate-code is a recent addition and it doesn't move the large .rodata etc before .text.
gold hasn't implemented -z separate-code. They are more vulnerable to the issue.

I do not really understand why we are discussing that question?
Even if that problem is less important for lld than for others linkers - it still exists.
So we need to patch lld anyway.

Nevertheless above description based on the fact that :

" If you have a compilation unit that is sufficiently
large (large DW_AT_high_pc, 0 DW_AT_low_pc), it is reasonable to have a sufficiently large PF_R PT_LOAD.
Then the range will not collide with a good compilation unit with large DW_AT_low_pc."

taking into account that :

ld sets default image base to 0x400000 for x86_64.
gold default image base is 0x400000 for x86_64.
lld default image base is 0x200000 for x86_64

  1. size of sections put before .text section by lld should be greater than 0x200000 to have a benefit comparing with other linkers. In case size of sections would be less than 0x200000 then lld would have a problem but other linkers would not.
  1. Actually it is not correct to link size of .text section and size of previously put sections as in above assumption. Size of text sections should be compared with size of deleted code. if there were deleted big piece of data then broken range could be big - 0x0-0x300000 f.e; But correct code could be smaller(size of text section is 1000 f.e) - in this case there would be overlapping address ranges in-spite of the fact that lld put some sections before .text section.

Could you explicitly describe that case, please ? i.e. When using UINT64_MAX-1 as start of address range for relocations pointing to deleted code would not be good for others ?

You are the one who propose to change the status quo and use UINT64_MAX-1. The burden of proof is on your side. You have the obligation to collect evidence from approaches taken by various DWARF consumers to prove the newly introduced special value wouldn't cause problems. However, I'd be happy to dig up history of various debuggers to learn what they do when I am free.

I also want to make it clear: your code changes the general behavior that applies to any non-allocated section, not something as specific as how a DW_AT_low_pc in a .debug_info is resolved. .rela.debug_* is a case of relocations into non-allocated sections but it isn't the only case. Do we have other such sections that may be thwarted (say -2 might be harmful to them) by your patch?

llvm-symbolizer and addr2line just do not have enough information to properly select correct address range.

If other DWARF consumers (debuggers, dtrace, profilers, etc) are happy with 0 DW_AT_low_pc, the problem is probably on the symbolizers' side. And I have sufficient argument that addr2line doesn't do well and I am about to file a bug.

taking into account that :

ld sets default image base to 0x400000 for x86_64.
gold default image base is 0x400000 for x86_64.
lld default image base is 0x200000 for x86_64

  1. size of sections put before .text section by lld should be greater than 0x200000 to have a benefit comparing with other linkers. In case size of sections would be less than 0x200000 then lld would have a problem but other linkers would not.
  2. Actually it is not correct to link size of .text section and size of previously put sections as in above assumption. Size of text sections should be compared with size of deleted code. if there were deleted big piece of data then broken range could be big - 0x0-0x300000 f.e; But correct code could be smaller(size of text section is 1000 f.e) - in this case there would be overlapping address ranges in-spite of the fact that lld put some sections before .text section.

You arguments here make me think we are not on the same page. The image base is used only by non-PIE executable, neither -pie nor -shared. The .debug_info input section associated with a discarded text section should have its DW_AT_high_pc larger than the DW_AT_low_pc of a good compilation unit to cause collision, not less than 0x200000 as you said.

size of previously put sections

Actually they matter a lot. lld places the PF_R PT_LOAD segment before PF_R|PF_X PT_LOAD (text segment). The text segment contains compilation units. If the PF_R is sufficiently large, it will make the relative address of the text segment larger, thus unlikely to collide.

avl added a comment.May 7 2019, 1:37 AM

You are the one who propose to change the status quo and use UINT64_MAX-1.
The burden of proof is on your side. You have the obligation to collect evidence from
approaches taken by various DWARF consumers to prove the newly introduced special value wouldn't cause problems

Sure. I though you have in mind something concrete.

I checked behavior of addr2line and gdb -they are working correctly with this patch.

Could you tell which check would be good from your opinion to do more, please ?

I also want to make it clear: your code changes the general behavior that applies to any non-allocated section,
not something as specific as how a DW_AT_low_pc in a .debug_info is resolved. .rela.debug_*
is a case of relocations into non-allocated sections but it isn't the only case.
Do we have other such sections that may be thwarted (say -2 might be harmful to them) by your patch?

Originally that patch was limited to only .debug* sections. But it was requested to extend it to all non-allocated section.
https://reviews.llvm.org/D59553#1445935
Since there is no requirement that certain sections should be assigned to 0 then I agree it should not be a problem to assign any other suitable value for them.

If other DWARF consumers (debuggers, dtrace, profilers, etc) are happy with 0 DW_AT_low_pc,
the problem is probably on the symbolizers' side. And I have sufficient argument that addr2line doesn't do well and I am about to file a bug.

actually other DWARF consumers are not happy. They implemented various heuristics.
Instead of implementing various, potentially not matching, heuristics it is possible to just avoid address ranges clash.

Could you confirm that solution suggested in D60470 could not handle following situation correctly ?

One piece of debug info references removed address range DW_AT_low_pc=0x0 DW_AT_high_pc=0x10000
Another piece of debug info references correct address range DW_AT_low_pc=0x0 DW_AT_high_pc=0x10000

You arguments here make me think we are not on the same page. The image base is used only by non-PIE executable,
neither -pie nor -shared. The .debug_info input section associated with a discarded text section
should have its DW_AT_high_pc larger than the DW_AT_low_pc of a good compilation unit to cause collision, not less than 0x200000 as you said.

One piece of debug info references removed address range DW_AT_low_pc=0x0 DW_AT_high_pc=0x350000
Another piece of debug info references correct address range DW_AT_low_pc=0x300000 DW_AT_high_pc=0x390000

in this case DW_AT_high_pc of discarded text section larger than the DW_AT_low_pc of a good compilation unit.
lld put tables which size is 0x100000 before .text section so DW_AT_low_pc=0x300000.
lld generated executable would have a problem. ld and gold generated would not.

Actually they matter a lot. lld places the PF_R PT_LOAD segment before PF_R|PF_X PT_LOAD (text segment).
The text segment contains compilation units. If the PF_R is sufficiently large, it will make the relative address of the text segment larger, thus unlikely to collide.

they matter if they are significantly large - otherwise they are not.

One piece of debug info references removed address range DW_AT_low_pc=0x0 DW_AT_high_pc=0x280000
Another piece of debug info references correct address range DW_AT_low_pc=0x202000 DW_AT_high_pc=204000

in this case DW_AT_high_pc of discarded text section larger than the DW_AT_low_pc of a good compilation unit.
lld put tables which size is 0x2000 before .text section so DW_AT_low_pc=202000.
lld generated executable would have a problem. ld and gold generated would not.

actually other DWARF consumers are not happy.

Example please.

I checked behavior of addr2line and gdb -they are working correctly with this patch.

Can you check if gdb works without this patch?

Could you confirm that solution suggested in D60470 could not handle following situation correctly ?

Another piece of debug info references correct address range DW_AT_low_pc=0x0 DW_AT_high_pc=0x10000

D60470 doesn't specialize case zero address, it changes priority. I suggest you actually take a look at that patch. The behavior of ld.bfd/gold will unlikely to change. D60470 will also help them.

I want to see a concrete example where 0 is used as a valid DW_AT_low_pc.

I can change vm.mmap_min_addr to 0 on my Linux and mmap a fixed page at 0 freely, but that doesn't mean 0 can be a valid DW_AT_low_pc. The 0 page can be mapped but more likely used as file header, irq, etc but unlikely to be used as code.

lld generated executable would have a problem. ld and gold generated would not.

You may check how the R and RW PT_LOAD segments are laid out in ld.bfd and gold linked modules.
You were also wrong with your understanding of image base (it was irrelevant in -shared/-pie) in your previous comments.
I just feel my explanation about how lld's layout largely mitigates the issue is kept being ignored.

I want to see a concrete example where 0 is used as a valid DW_AT_low_pc.

I can change vm.mmap_min_addr to 0 on my Linux and mmap a fixed page at 0 freely, but that doesn't mean 0 can be a valid DW_AT_low_pc. The 0 page can be mapped but more likely used as file header, irq, etc but unlikely to be used as code.

In Sony executables, the file header is not stored in loadable memory. Our first PT_LOAD has address 0 and sections start from that address. Technically, we don't have any code at that address yet, but we are planning on removing the limitation that stops us from doing so, in which case we will have code at address 0. As mentioned earlier, we have solved the problem for now by using a special value in DW_AT_low_pc, and other techniques, but these all require private patches, so an alternative would be nice.

I do agree that this patching should be limited to debug sections though. Other consumer sections might have other rules regarding special values etc, and be working on the assumption of patching to 0 is the best plan.

avl added a comment.May 17 2019, 9:24 AM

@MaskRay

actually other DWARF consumers are not happy.

Example please.

please check following behavior of lldb, llvm-symbolizer, gnu addr2line, gnu objdump :

$ cat main.cpp
void foo_used();

int main(void) {

foo_used();
return 0;

}
$ cat funcs.cpp
void foo_not_used () {

__asm__(".rept 2105344; nop; .endr");

}

void foo_used () {

__asm__(".rept 10000; nop; .endr");

}

$ clang++ -gdwarf-4 -O funcs.cpp -ffunction-sections -c
$ clang++ -gdwarf-4 -O main.cpp -ffunction-sections -c
$ clang++ -gdwarf-4 -O funcs.o main.o -fuse-ld=lld -Wl,--gc-sections -o res.out
$ lldb res.out
(lldb) disassemble -name main
res.out`main:
res.out[0x203810] <+0>: pushq %rax
res.out[0x203811] <+1>: callq 0x2010f0 ; foo_not_used + 240
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
res.out[0x203816] <+6>: xorl %eax, %eax
res.out[0x203818] <+8>: popq %rcx
res.out[0x203819] <+9>: retq

(lldb) b foo_used
Breakpoint 1: where = res.out`foo_not_used() + 240, address = 0x00000000002010f0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

$ llvm-symbolizer -obj=res.out 0x00000000002010f0
foo_used()
/home/avl/bugs/gc_debuginfo/funcs.cpp:2:5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

$ addr2line -e res.out 0x00000000002010f0
/home/avl/bugs/gc_debuginfo/funcs.cpp:2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

$ objdump -d -S res.out
0000000000201000 <_start>:
void foo_not_used () {

__asm__(".rept 2105344; nop; .endr");

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

201000:       31 ed                   xor    %ebp,%ebp
201002:       49 89 d1                mov    %rdx,%r9
201005:       5e                      pop    %rsi

And this patch makes them happy:

$ lldb res.out
(lldb) disassemble -name main
res.out`main:
res.out[0x203810] <+0>: pushq %rax
res.out[0x203811] <+1>: callq 0x2010f0 ; foo_used at funcs.cpp:6:5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
res.out[0x203816] <+6>: xorl %eax, %eax
res.out[0x203818] <+8>: popq %rcx
res.out[0x203819] <+9>: retq

(lldb) b foo_used
Breakpoint 1: where = res.out`foo_used() at funcs.cpp:6:5, address = 0x00000000002010f0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

$ llvm-symbolizer -obj=res.out 0x00000000002010f0
foo_used()
/home/avl/bugs/gc_debuginfo/funcs.cpp:6:5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

$ addr2line -e res.out 0x00000000002010f0
/home/avl/bugs/gc_debuginfo/funcs.cpp:6
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

$ objdump -d -S res.out
0000000000201000 <_start>:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

201000:       31 ed                   xor    %ebp,%ebp
201002:       49 89 d1                mov    %rdx,%r9
201005:       5e                      pop    %rsi
201006:       48 89 e2                mov    %rsp,%rdx
201009:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp

I checked behavior of addr2line and gdb -they are working correctly with this patch.

Can you check if gdb works without this patch?

gdb works correctly without this patch for cases when zero virtual address is not a correct value for code.
Because it uses 0 as special value indicating that address range is incorrect.
It would not work correctly for the platform where 0 is a valid vma for code.
It is generally wrong to use 0 as a special value.

Using 0 as the beginning of the section with code is not forbidden.

I want to see a concrete example where 0 is used as a valid DW_AT_low_pc.

  1. according to @jhenderson Sony has/would have in future such use case.
  2. according to this http://lists.llvm.org/pipermail/lldb-dev/2017-March/012091.html Qualcomm Kalimba DSP and the XAP RISC CPU have this use case. they even use similar to this patch approach to solve the problem.
  3. Such use case often used for embedded systems.

I personally do not have access to such systems to demonstrate a use case for you.

lld generated executable would have a problem. ld and gold generated would not.

You may check how the R and RW PT_LOAD segments are laid out in ld.bfd and gold linked modules.

For the above example :

$ clang++ -gdwarf-4 -O funcs.o main.o -fuse-ld=lld -Wl,--gc-sections -o res.out
$ addr2line -e res.out 0x00000000002010f0
/home/avl/bugs/gc_debuginfo/funcs.cpp:2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

$ clang++ -gdwarf-4 -O funcs.o main.o -fuse-ld=ld -Wl,--gc-sections -o res.out
$ addr2line -e res.out 0x00000000004004a0
/home/avl/bugs/gc_debuginfo/funcs.cpp:6

$ clang++ -gdwarf-4 -O funcs.o main.o -fuse-ld=gold -Wl,--gc-sections -o res.out
$ addr2line -e res.out 0x0000000000400510
/home/avl/bugs/gc_debuginfo/funcs.cpp:6

lld generated binary has problem, ld and gold do not have.

avl added a comment.May 17 2019, 9:28 AM

@jhenderson

I do agree that this patching should be limited to debug sections though.
Other consumer sections might have other rules regarding special values etc, and be working on the assumption of patching to 0 is the best plan.

OK, I would change a patch accordingly and update shortly.

MaskRay added a subscriber: labath.May 18 2019, 8:43 PM

I don't see a problem with changing lldb's behavior to add special treatment for address zero. It's not the most ideal solution (that would be to avoid generating the debug info in the first place), but given that we have to accept inputs from other linkers too (which also implement this behavior), it's probably the best thing we can do now.

avl added a comment.May 20 2019, 3:01 AM

@labath

In general it is not possible to handle zero case properly in lldb. It does not have information which address range is correct.
if platform allows section to contain code starting from zero then it is not possible to detect good and bad address range:

One piece of debug info references removed address range DW_AT_low_pc=0x0 DW_AT_high_pc=0x10000
Another piece of debug info references correct address range DW_AT_low_pc=0x0 DW_AT_high_pc=0x10000

which address range should be selected ?

Solution from this patch prevents such situations. lld would not generate clashed correct address ranges at all.

Do you think it is bad to have a solution which allows to not to have clashed address ranges ?

lldb still is able to implement some heuristics for code generated by ld/gold.(which would also not work for zero-address platform)
But binaries generated by lld would be properly handled by all DWARF consumers without any change in them.
Because there would not be clashed addresses.

avl updated this revision to Diff 200318.May 20 2019, 9:47 AM

limit resolving relocations to UINT64_MAX-1 value for only debug sections, as requested.

I'm personally happy with the implementation now, but obviously there needs to be agreement from the other reviewers.

avl updated this revision to Diff 201495.Mon, May 27, 4:05 AM

deleted redundant tests.

avl added a comment.Sun, Jun 2, 11:58 PM

ping. @ruiu Do you think it would be useful to integrate this patch ?

this patch already has Looks Good from @jhenderson and @grimar .

It helps lldb, llvm-symbolizer, gnu addr2line, gnu objdump to show debug info properly in case clashed address ranges.

ruiu added a comment.Mon, Jun 3, 12:01 AM

I'm fine with this patch. We've discussed a lot about what is a desirable value for a "null" for debug info, but looks like as long as it does not appear as a valid address, it should be fine.

Is Eric fine with this?

avl added a comment.Tue, Jun 4, 7:28 AM

@echristo What is your opinion on this patch ? Is it worth to integrate it ?

avl added a comment.Fri, Jun 14, 2:09 PM

@echristo Hi Eric, Could check this revision, please ? Is that patch Ok for you ?