Page MenuHomePhabricator

[ELF] Resolve relocations in .debug_* referencing (discarded symbols or ICF folded section symbols) to tombstone values
ClosedPublic

Authored by MaskRay on Sat, Jun 13, 12:30 AM.

Details

Summary

See D59553, https://lists.llvm.org/pipermail/llvm-dev/2020-May/141885.html and
https://sourceware.org/pipermail/binutils/2020-May/111357.html for
extensive discussions on a tombstone value.
See http://www.dwarfstd.org/ShowIssue.php?issue=200609.1
(Reserve an address value for "not present") for a DWARF enhancement proposal.

We resolve such relocations to a tombstone value to indicate that the address is invalid.
This solves several problems (the normal behavior is to resolve the relocation to the addend):

  • For an empty function in a collected section, a pair of (0,0) can terminate .debug_loc and .debug_ranges (as of binutils 2.34, GNU ld resolves such a relocation to 1 to avoid the .debug_ranges issue)
  • If DW_AT_high_pc is sufficiently large, the address range can collide with a regular code range of low address (https://bugs.llvm.org/show_bug.cgi?id=41124 )
  • If a text section is folded into another by ICF, we may leave entries in multiple CUs claiming ownership of the same range of code, which can confuse consumers.
  • Debug information associated with COMDAT sections can have problems similar to ICF, but is more complex - thus not addressed by this patch.

For pre-DWARF-v5 .debug_loc and .debug_ranges, a pair of 0 can terminate
entries (invalidating subsequent ranges).
-1 is a reserved value with special meaning (base address selection entry) which can't be used either.
Use -2 instead.

For all other .debug_*, use UINT32_MAX for 32-bit targets and UINT64_MAX
for 64-bit targets. In the code, we intentionally use
uint64_t tombstone = UINT64_MAX for 32-bit targets as well: this matches
SignExtend64 as used in relocateAlloc. (Actually UINT32_MAX does not work for R_386_32)

Note 0, we only special case target->symbolicRel (R_X86_64_64, R_AARCH64_ABS64, R_PPC64_ADDR64), not
short-range absolute relocations (e.g. R_X86_64_32). Only forms like DW_FORM_addr need to be special cased.
They can hold an arbitrary address (must be 64-bit on a 64-bit target). (In theory,
producers can make use of small code model to emit 32-bit relocations. This doesn't seem to be leveraged.)

Note 1, we have to ignore the addend, because we don't want to resolve
DW_AT_low_pc (which may have a non-zero addend) to -1+addend (wrap
around to a low address):

__attribute__((section(".text.x"))) void f1() { }
__attribute__((section(".text.x"))) void f2() { } // DW_AT_low_pc has a non-zero addend

Note 2, if the prevailing copy does not have debugging information while
a non-prevailing copy has (partial debug build), we don't do extra work
to attach debugging information to the prevailing definition. (clang
has a lot of debug info optimizations that are on-by-default that assume
the whole program is built with debug info).

clang -c -ffunction-sections a.cc    # prevailing copy has no debug info
clang -c -ffunction-sections -g b.cc

Diff Detail

Event Timeline

MaskRay created this revision.Sat, Jun 13, 12:30 AM
MaskRay edited the summary of this revision. (Show Details)Sat, Jun 13, 12:32 AM
MaskRay edited the summary of this revision. (Show Details)Sat, Jun 13, 12:39 AM
MaskRay updated this revision to Diff 270567.Sat, Jun 13, 12:42 AM

Update comment

I only partially followed the discussion on llvm-dev. The resolution sounds sensible to me. Will be good to get someone involved in the upstream discussion to check it over.

Not related to this patch:
FWIW Arm's proprietary linker has a relatively simple garbage collection algorithm for debug sections. Debug sections were partitioned into two sets, the first set contained relocations to non-debug sections, the second set did not but were reachable via relocations from the first set. After gc of non-debug sections, the members of the first set were marked live if any of the non-debug sections they referenced were marked live. We'd then traverse the relocations from the live members of set 1 to find the live members of set 2. This worked well for Arm's proprietary compiler as it put quite a lot of effort into generating separating out debug sections per functions. It wasn't perfect and there were .debug_info sections that referenced multiple independent .text sections, but did permit quite a lot of debug to be removed without having to parse the dwarf.

jhenderson added inline comments.Wed, Jun 17, 12:52 AM
lld/ELF/InputSection.cpp
930

Nit: spurious space before the full stop.

lld/test/ELF/debug-dead-reloc-32.s
3

tombstone -> tombstone value

Here and in other tests.

18

Is the 'e' for SHF_EXCLUDE? If so, do we actually want it and --gc-sections? Seems like one or the other would be sufficient to me.

lld/test/ELF/debug-dead-reloc-icf.s
3–6

One issue that I don't recall whether it was brought up previously: if we resolve one to -1, and the other has no debug data (e.g. they came from two different CUs, only one of which was compiled with -g), we could actually lose debug data for the function completely.

I don't know what the right thing to do is in this situation, as I don't know whether it's better than the risk of confusing consumers. I don't think we can reasonably attempt to identify whether there is matching debug data for the kept version after all.

13

Perhaps it would be better to spell out all the zeros for the first relocation target, as it was a bit confusing seeing one set of 8 followed by a set of 16.

dblaikie added inline comments.Wed, Jun 17, 10:02 AM
lld/test/ELF/debug-dead-reloc-icf.s
3–6

Yeah, I'd accept that as an existing problem/not one that I think this tombstone work should necessarily address - the old solution or the new solution would be compatible with a linker making a more nuanced approach to this problem.

(but, in that separate discussion: I'm not sure this would be appropriate to fix - fixing this, I think, would mean having debug info influence which version of a function the linker chooses - increasing the risk of heisenbugs)

MaskRay updated this revision to Diff 271438.Wed, Jun 17, 12:07 PM
MaskRay marked 6 inline comments as done.
MaskRay edited the summary of this revision. (Show Details)

Address comments.

Add Note 2 to the description.

lld/ELF/InputSection.cpp
930

I tend to not write .debug_ranges. I will add sections

lld/test/ELF/debug-dead-reloc-32.s
18

I'll drop --gc-sections.

lld/test/ELF/debug-dead-reloc-icf.s
3–6

https://reviews.llvm.org/D59553#2083857

My understanding is that: the prevailing definition should have associated debug info. If not, the linker should not do extra work to attach debug info from other copies to the prevailing definition.

MaskRay updated this revision to Diff 271439.Wed, Jun 17, 12:10 PM

Fix debug-dead-alloc*.s

jhenderson accepted this revision.Thu, Jun 18, 12:41 AM

LGTM, but I'd want @avl to be happy with it, since he did the original work essentially.

lld/test/ELF/debug-dead-reloc-icf.s
3–6

FWIW, I looked at our properietary linker and downstream LLD ports, both of which have the -1 patching behaviour implemented, and we don't handle the "some with, some without" case at all, so ICF folded symbols can theoretically lose debug data, same as is implemented in this patch. So from our point of view, there's no regression, just a continued space from improvement.

This revision is now accepted and ready to land.Thu, Jun 18, 12:41 AM
avl added a comment.Thu, Jun 18, 4:13 AM

I am happy with it. There is a small comment inside. other than that lgtm.

lld/ELF/InputSection.cpp
932

I think it is better not to do string comparisons for every relocation pointing to deleted code.
It is worth to move this name comparisons before the loop.

avl accepted this revision.Thu, Jun 18, 4:14 AM
MaskRay marked 5 inline comments as done.Thu, Jun 18, 12:16 PM
MaskRay added inline comments.
lld/ELF/InputSection.cpp
932

I think it is actually more efficient doing string comparisons here. The tombstone value code path is rare. By doing string comparisons here, we don't perform string comparisons in the normal code paths.

dblaikie added inline comments.Thu, Jun 18, 12:56 PM
lld/ELF/InputSection.cpp
932

But you have to do it repeatedly for every relocation in this section, yes? - and .debug_loc and .debug_ranges may contain many relocations.

The test for "is this a debug section" happens unconditionally once, outside the relocation application loop - it seems like this test could/should happen in the same place (& it can be conditional on "is this a debug section":

`
bool isDebugLocOrRanges = isDebugLoc && name == ".debug_loc" || name == ".debug_ranges";
`

Seems like it'd be cheap enough (known short length, the string's already hot in the cache, etc). (alternatively the test could be computed lazily and cached - but given "isDebug" isn't done that way I don't think this one should be either)

MaskRay updated this revision to Diff 271826.Thu, Jun 18, 1:20 PM
MaskRay marked an inline comment as done.

Hoist isDebugLocOrRanges

MaskRay marked an inline comment as done.Thu, Jun 18, 1:21 PM
dblaikie accepted this revision.Thu, Jun 18, 5:13 PM

Some optional code layout and comments - but all good either way/whatever you find most readable/helpful.

lld/ELF/InputSection.cpp
908

Probably not a big deal, but curious to do this test when it might not be needed. Ah, partly because fo the if/else if/else structure - if this used early-return instead, perhaps it would be easier to write it like this:

if (sym.isTls() && !Out::tlsPhdr) {
  target->relocateNoSym(bufLoc, type, 0);
  continue;
}
if (isDebug && type == target->symbolicRel) {
  auto *ds = dyn_cast<Defined>(&sym);
  if (!sym.getOutputSection() ||
        (ds && ds->section->repl != ds->section)) {
    target->relocateNoSym(bufLoc, type, isDebugLocOrRanges ? UINT64_MAX - 1 : UINT64_MAX);
    continue;
  }
}
target->relocateNoSym(bufLoc, type,
                            SignExtend64<bits>(sym.getVA(addend)));
913–917

"normal behavior" might be overstating it - it was the behavior of lld and is the behavior of gold, but isn't the behavior of bfd.ld.

maybe "Other options such as zero (1 for debug_ranges) (bfd.ld) or zero+addend (gold) result in overlaps with valid low-address ranges, or deeper problems terminating debug_loc and debug_range lists early"?

920–922

I think you can omit the "DW_AT_low_pc (which may have a non-zero addend)" part of this, perhaps? In general we don't want -1+addend to wrap around into a valid address range, no matter which attribute it might be on.

929–930

Might clarify "a pair of 0 (0+addend could cause this when handling an empty function)" (though perhaps that deosn't need to be called out here - since we already motivated this earlier in the comment talking about low or zero addresses - so only the -1 case needs to be justified/explained why it can't use the same solution as all the other .debug_ sections being handled here)

Probably "the entries and -1" rather than a comma there - since there's only two things in this list. (& maybe the end of this "Use -2 instead" might not stand on its own as a sentence - "... so -2 is used."

lld/test/ELF/debug-dead-reloc.s
38

So even if the comdat group is identical, this will resolve only one of them & use -1 for the other?

I think that's a regression/separate change that probably should have a sepraate discussion compared to the -1 rather than 0+addend change.

Seems there could be some value in resolving all the function descriptions to the same code where possible - but I also see the problem with having CUs ranges overlap (which I think @probinson is especially concerned about). So the change may well be merited, but I think deserves a separate review/discussion.

(eg: currently... oh, bfd and gold currently do the "point all the DWARF to the one definition" but it seems lld doesn't do that and, as you say, already does zero (well, zero+addend) out the duplicate references to the same deduplicated comdat) Fair enough then... curious, but not a change in behavior, so guess we slipped that in somewhere along the way & hasn't broken anything that I've heard.

MaskRay updated this revision to Diff 271893.Thu, Jun 18, 6:55 PM
MaskRay marked 7 inline comments as done.

Address comments

lld/ELF/InputSection.cpp
908

Thanks for the suggestion! I thought about the additional test a normal code path would run. It seems that you're also concerned - so I'll refactor the if else if else structure as you suggested to avoid an unneeded dyn_cast.

913–917

I'll just say "Resolving to addend". I may send a patch to GNU ld, and a statement about GNU ld's behavior may be soon outdated.

lld/test/ELF/debug-dead-reloc.s
38

A relocation referencing a local symbol defined in a non-prevailing comdat group currently resolves to 0+addend, instead of prevailing_symbol_address+addend. We change it from 0+addend to -1. This is not a regression.

A relocation referencing a global symbol defined in a non-prevailing comdat group resolves prevailing_symbol_address+addend (ELF rule). This patch does not change the behavior. Though, I can't find any global symbol reference in .debug_*
Added a test.

dblaikie added inline comments.Thu, Jun 18, 10:10 PM
lld/ELF/InputSection.cpp
908

yeah, I doubt it's a huge actual perf concern - but shorter scoped variables, etc, make for easier to understand code

lld/test/ELF/debug-dead-reloc.s
38

Yep, just local symbols - you can test the behavior between bfd/gold/lld with this example:

a.cpp:

inline void f1() { }
void f2() { f1(); }

b.cpp:

inline void f1() { }
void f2();
int main() {
  f1();
  f2();
}
$ clang++ f1.cpp f2.cpp -g
$ llvm-dwarfdump a.out

& you'll see that gold and bfd resolve all the relocations to the same comdat to the same address - though this only happens if the comdats are identical. If they differ (say, if you make f1 do some arithmetic and optimize one but don't optimize the other) then the behavior will match lld - keeping one but pointing only the matching DWARF to that definition, and all the other DWARF will point to the tombstone value (whatever that is in each linker).

Anyway - nothing new here/related to this patch, and the lld behavior's probably preferable for a lot of folks - means CUs don't start overlapping unintentionally, etc.

MaskRay marked an inline comment as done.Thu, Jun 18, 11:25 PM
MaskRay added inline comments.
lld/test/ELF/debug-dead-reloc.s
38

This was previously discussed on https://bugs.llvm.org/show_bug.cgi?id=37212
I agree with you that it deserves its own email thread.

See PRETEND and _bfd_elf_check_kept_section in binutils-gdb/bfd/elflink.c for GNU ld's behavior. GNU ld finds a replacement if the sizes are equal.

Anyway - nothing new here/related to this patch, and the lld behavior's probably preferable for a lot of folks - means CUs don't start overlapping unintentionally, etc.

I agree. I think a tombstone value is better than PRETEND logic.

jhenderson accepted this revision.Fri, Jun 19, 12:34 AM

Latest changes still LGTM, FWIW.

@grimar @probinson Does the latest version look good to you?

grimar added inline comments.Mon, Jun 22, 5:01 AM
lld/ELF/InputSection.cpp
857

use const bool for these 2?

912

Missing an empty line before if.

935

It looks a bit unideal to me, because auto *ds = dyn_cast<Defined>(&sym) is always called,
though it is not always needed.
The same happens with isDebugLocOrRanges ? UINT64_MAX - 1 : UINT64_MAX condition, which
can be caluclated once.
Also isDebugSection() just do an excessive job it seems, because it checks the ".debug_" prefix.
And sym.getOutputSection() checks that sym is Defined internally, so this check is done twice too.

What about doing the following?

template <class ELFT, class RelTy>
void InputSection::relocateNonAlloc(uint8_t *buf, ArrayRef<RelTy> rels) {
  const unsigned bits = sizeof(typename ELFT::uint) * 8;
  const bool isDebug = isDebugSection(*this);
  const uint64_t Tombstone = (name == ".debug_loc" || name == ".debug_ranges")
                                 ? (UINT64_MAX - 1)
                                 : UINT64_MAX;
  ....

    if (sym.isTls() && !Out::tlsPhdr) {
      target->relocateNoSym(bufLoc, type, 0);
      continue;
    }

    if (!isDebug || type != target->symbolicRel) {
      target->relocateNoSym(bufLoc, type,
                            SignExtend64<bits>(sym.getVA(addend)));
      continue;
    }

    auto *ds = dyn_cast<Defined>(&sym);
    if (!ds || !ds->section->repl || !ds->section->repl->getOutputSection() ||
        ds->section->repl != ds->section)
      target->relocateNoSym(bufLoc, type, Tombstone);
    else
      target->relocateNoSym(bufLoc, type,
                            SignExtend64<bits>(sym.getVA(addend)));
  }
}
grimar added inline comments.Mon, Jun 22, 5:08 AM
lld/ELF/InputSection.cpp
935

Also isDebugSection() just do an excessive job it seems, because it checks the ".debug_" prefix.

By this I meant that probably no reason to have isDebug && part in the condition:

isDebug && (name == ".debug_loc" || name == ".debug_ranges")

Though it is not the piece to really worry about.

MaskRay marked 3 inline comments as done.Mon, Jun 22, 9:21 AM
MaskRay added inline comments.
lld/ELF/InputSection.cpp
935

isDebugLocOrRanges ? UINT64_MAX - 1 : UINT64_MAX can easily hoisted by the compiler if it likes. Manually doing this is unnecessary.

Though it is not the piece to really worry about.

Agree.

because auto *ds = dyn_cast<Defined>(&sym) is always called, though it is not always needed.

The suggested transformation looks a bit unnecessary to me. Most relocations from non-SHF_ALLOC sections are from .debug_*. The remaining ones are uncommon and/or tiny sections such as __patchable_function_entries, .stack_sizes, .llvm_addrsig, etc. There is very little we can save from the restricting the scope of dyn_cast.

I want to keep the control flows as is.

MaskRay updated this revision to Diff 272467.Mon, Jun 22, 9:22 AM

Add an empty line. Add const

I've been watching, and have nothing to add to what James or Dave had to say.

This revision was automatically updated to reflect the committed changes.

Whilst merging this change into our downstream port, our private tests flagged up a behaviour difference between our old behaviour and the version this patch implements. On further investigation, I think it is a bug in the upstream patch rather than our merge or tests.

Consider the ICF case with two duplicate functions:

int foo(int x)
{
	return 42 + x;
}

int bar(int x)
{
	return 42 + x;
}

int main(){
	int x = 0;
	x += foo(x);
	x += bar(x);
	return x;
}

In this case, LLD --icf=all will result in foo and bar being combined, with foo being the "kept" function. Prior to this change, references from debug data to the folded function (i.e. bar) were updated to point at the kept function. This would of course mean there was duplicate debug data for various addresses. In general, this may not have been ideal, but in one case I think it is actually necessary, specifically in the case of .debug_line. If you run gdb on the above example, having linked it using LLD, you can set a breakpoint at lines within bar (e.g. line 8 in my example), and it will be hit both when foo is executed and when bar is executed. This is reasonable behaviour, given the debug data doesn't include call site information the debugger can use here. With this patch, the reference in .debug_line to bar is patched to -1. This means that placing a breakpoint at line 8 will result in a breakpoint at address 0x6 (in my local case), rather than in the middle of foo. This in turn means the breakpoint will never be hit. From a user's point of view, if they don't know where their function is being folded into (which would probably be the norm), the breakpoint not working properly is a serious regression. Of course, if it is placed, the program will jump to a different function than expected (and may even break unexpectedly), but I think personally that this is less confusing, especially as that is the old behaviour anyway.

I think we need to add a special case for patching .debug_line, so that duplicates are not patched to -1, but rather to the symbol they are a duplicate of (dead references should still be patched to -1). This would match our downstream's old behaviour.

Whilst merging this change into our downstream port, our private tests flagged up a behaviour difference between our old behaviour and the version this patch implements. On further investigation, I think it is a bug in the upstream patch rather than our merge or tests.

Consider the ICF case with two duplicate functions:

int foo(int x)
{
	return 42 + x;
}

int bar(int x)
{
	return 42 + x;
}

int main(){
	int x = 0;
	x += foo(x);
	x += bar(x);
	return x;
}

In this case, LLD --icf=all will result in foo and bar being combined, with foo being the "kept" function. Prior to this change, references from debug data to the folded function (i.e. bar) were updated to point at the kept function. This would of course mean there was duplicate debug data for various addresses. In general, this may not have been ideal, but in one case I think it is actually necessary, specifically in the case of .debug_line. If you run gdb on the above example, having linked it using LLD, you can set a breakpoint at lines within bar (e.g. line 8 in my example), and it will be hit both when foo is executed and when bar is executed. This is reasonable behaviour, given the debug data doesn't include call site information the debugger can use here. With this patch, the reference in .debug_line to bar is patched to -1. This means that placing a breakpoint at line 8 will result in a breakpoint at address 0x6 (in my local case), rather than in the middle of foo. This in turn means the breakpoint will never be hit. From a user's point of view, if they don't know where their function is being folded into (which would probably be the norm), the breakpoint not working properly is a serious regression. Of course, if it is placed, the program will jump to a different function than expected (and may even break unexpectedly), but I think personally that this is less confusing, especially as that is the old behaviour anyway.

I think we need to add a special case for patching .debug_line, so that duplicates are not patched to -1, but rather to the symbol they are a duplicate of (dead references should still be patched to -1). This would match our downstream's old behaviour.

I have mixed feelings about this - I assume any strategy for this would likely also trigger for the case I've seen more often, which is two inline function definitions that are identical and rather than one DW_TAG_subprogram getting to point to the final copy, both DW_TAG_subprograms in different CUs point to the final copy (if that copy is the same length, I think is the test in other linkers - but maybe it's if their contents are identical)?

Is this a change in lld's behavior? Or has lld always resolved identical-but-deduplicated functions to only the one subprogram, or to all subprograms?

Or is your workaround (the one you have in your fork) /only/ for the line table, but not for DW_AT_ranges on the CU or for the low_pc of the subprogram? That would seem pretty quirky - having addresses covered by the line table that aren't covered by the ranges of the CU? But also if it is applied to teh CU too, then you have CUs with overlapping ranges (@probinson I think expressed some concern about overlapping CUs being undesirable previously)

I have mixed feelings about this - I assume any strategy for this would likely also trigger for the case I've seen more often, which is two inline function definitions that are identical and rather than one DW_TAG_subprogram getting to point to the final copy, both DW_TAG_subprograms in different CUs point to the final copy (if that copy is the same length, I think is the test in other linkers - but maybe it's if their contents are identical)?

Is this a change in lld's behavior? Or has lld always resolved identical-but-deduplicated functions to only the one subprogram, or to all subprograms?

I haven't specifically looked at the other debug sections. I believe they previously contained duplicate addresses, but am not 100% sure. In most cases this is bad due to ambiguity reasons, I believe. In our downstream port, we patched to -1/-2 (except for .debug_line).

Or is your workaround (the one you have in your fork) /only/ for the line table, but not for DW_AT_ranges on the CU or for the low_pc of the subprogram? That would seem pretty quirky - having addresses covered by the line table that aren't covered by the ranges of the CU? But also if it is applied to teh CU too, then you have CUs with overlapping ranges (@probinson I think expressed some concern about overlapping CUs being undesirable previously)

Our implementation specifically special-cases behaviour for .debug_line to be different to other debug sections, which used -1/-2 for the address references to de-duplicated code, so .debug_ranges for example didn't have overlapping ranges etc. I don't think it's that quirky - the debug line table is actually independent of other debug data and can usefully exist without .debug_info etc; it doesn't know anything about its corresponding CU (apart from some file paths). That all being said, it's been a fairly recent change in our downstream port. Before that, we just did what this patch does in .debug_line as well as other sections.

Another of our downstream tests has flagged up another issue with this change, this time with TLS symbols. clang appears to use R_X86_64_DTPOFF64 relocations for such symbols, rather than R_X86_64_64 relocations. This means the condition type == target->symbolicRel is not triggering for them. In our downstream port, we didn't check the relocation type, so the references to the discarded TLS variables was always patched to -1. Meanwhile, old LLD before this change appears to be patching it to 0. New LLD is patching it to something relative to the start of the TLS block (in my case it's 0xffffffffffff8000, but I imagine it depends on layout/base address etc). I think we need to relax the relocation type check, possibly even remove it completely.

Whilst trying to create a small fix locally, I noticed that our downstream test was originally motivated by a R_X86_64_DTPOFF32 relocation rather than a 64-bit one, since the DWARF was using a 4-byte location offset in this case. I guess we need to allow both these relocation types to follow the same code path as R_X86_64_64. At that point, I start to wonder whether we should just allow all relocation types to follow that path...

Whilst trying to create a small fix locally, I noticed that our downstream test was originally motivated by a R_X86_64_DTPOFF32 relocation rather than a 64-bit one, since the DWARF was using a 4-byte location offset in this case. I guess we need to allow both these relocation types to follow the same code path as R_X86_64_64. At that point, I start to wonder whether we should just allow all relocation types to follow that path...

I agree that R_X86_64_DTPOFF{32,64} (R_DTPREL) should use -1 as well, but I don't think R_X86_64_32 should use it (tested in debug-dead-reloc-32.s)

Created D82899.