This is an archive of the discontinued LLVM Phabricator instance.

[DWARF] Emit DW_AT_call_pc for tail calls
ClosedPublic

Authored by vsk on Mar 17 2020, 6:05 PM.

Details

Summary

Record the address of a tail-calling branch instruction within its call
site entry using DW_AT_call_pc. This allows a debugger to determine the
address to use when creating aritificial frames.

This creates an extra attribute + relocation at tail call sites, which
constitute 3-5% of all call sites in xnu/clang respectively.

rdar://60307600

Diff Detail

Event Timeline

vsk created this revision.Mar 17 2020, 6:05 PM

(general caveat: this should be committed as separate patches - between llvm and llvm-dsymutil (much like llvm-dwarfdump and llvm codegen are usually committed separately) - but yeah, it's nice to see it all together for review))

Do you happen to have numbers for, say, clang, on the growth of object size because of this - specifically I'd be interested in the growth of the .rela.debug_addr section in a DWARFv5 build.

Did I understand correctly from previous discussions that the goal was to have call_sites for /every/ call, at some point? (I'm concerned that'll be a /lot/ of addresses)

I suppose another question, I guess - how many more call_sites is this than currently/without this patch?

djtodoro added inline comments.
llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
885โ€“886

Looks like the dwarf::DW_AT_call_return_pc is enough for GDB to distinguish that even for tail calls. Can we do the same for LLDB and avoid the call_pc?

llvm/test/DebugInfo/MIR/X86/call-site-gnu-vs-dwarf5-attrs.mir
29

GCC produces this:

<2><9a>: Abbrev Number: 9 (DW_TAG_GNU_call_site)
   <9b>   DW_AT_low_pc      : 0x23
   <a3>   DW_AT_GNU_tail_call: 1
   <a3>   DW_AT_abstract_origin: <0xd4>
djtodoro added inline comments.Mar 18 2020, 1:40 AM
llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1801

Should we use the isCandidateForCallSiteEntry() instead?

The dsymutil part LGTM.

vsk marked 2 inline comments as done.Mar 18 2020, 3:28 PM

(general caveat: this should be committed as separate patches - between llvm and llvm-dsymutil (much like llvm-dwarfdump and llvm codegen are usually committed separately) - but yeah, it's nice to see it all together for review))

Sounds good, I'll split this up before committing.

Do you happen to have numbers for, say, clang, on the growth of object size because of this - specifically I'd be interested in the growth of the .rela.debug_addr section in a DWARFv5 build.

I looked at the size impact on a Darwin -gdwarf-4 stage2 clang build. The aggregate size of .o's post-patch grows 0.006% (8367046264 bytes -> 8367544576 bytes, a 486KB increase). Tail calls make up ~5% of all calls (in clang at least), so this is in line with what was measured in D72489. In D72489, I added a relocation to every non-tail call site entry, which increased the aggregate .o size by 0.04% (~3MB).

I don't think linker support for dwarf5 is mature enough on Darwin for me to measure. Perhaps that measurement wouldn't be very useful, as ELF relocations are different.

Did I understand correctly from previous discussions that the goal was to have call_sites for /every/ call, at some point? (I'm concerned that'll be a /lot/ of addresses)

Yes. When tuning for lldb, we've been emitting call sites for every call for some time now.

Hm :(. Digging through the history, I see that I accidentally turned on DIFlagAllCallsDescribed when targeting -gdwarf-4 + GDB in D69743 (November 2019). This was done prematurely, I apologize for this. It should not have happened until the entry values feature got enabled by default. I've checked in test coverage for -gdwarf-4 + -ggdb to clang/test/CodeGenCXX/dbg-info-all-calls-described.cpp in 47622efc.

@dblaikie @djtodoro I'd prefer to leave things as they are, but am also open to disabling DIFlagAllCallsDescribed for -gdwarf-4 + -ggdb until entry values get re-enabled by default, let me know what you think.

I suppose another question, I guess - how many more call_sites is this than currently/without this patch?

This patch doesn't change the number of call site entries (DW_TAG_call_site), it just adds relocations to ~3-5% of those entries.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
885โ€“886

I don't think that would be valid DWARF. The return pc for a tail call isn't given by the address of the instruction after after the tail-calling branch.

Are you sure GDB uses DW_AT_call_return_pc to figure out the address where the tail call was made?

1801

Yes, thanks!

llvm/test/DebugInfo/MIR/X86/call-site-gnu-vs-dwarf5-attrs.mir
29

Is the DW_AT_low_pc the address of the tail-calling branch instruction, or is it the address of the next instruction?

vsk updated this revision to Diff 251198.Mar 18 2020, 3:38 PM
vsk marked an inline comment as done.

Incorporate feedback from @djtodoro.

In D76336#1929994, @vsk wrote:

(general caveat: this should be committed as separate patches - between llvm and llvm-dsymutil (much like llvm-dwarfdump and llvm codegen are usually committed separately) - but yeah, it's nice to see it all together for review))

Sounds good, I'll split this up before committing.

Do you happen to have numbers for, say, clang, on the growth of object size because of this - specifically I'd be interested in the growth of the .rela.debug_addr section in a DWARFv5 build.

I looked at the size impact on a Darwin -gdwarf-4 stage2 clang build. The aggregate size of .o's post-patch grows 0.006% (8367046264 bytes -> 8367544576 bytes, a 486KB increase). Tail calls make up ~5% of all calls (in clang at least), so this is in line with what was measured in D72489. In D72489, I added a relocation to every non-tail call site entry, which increased the aggregate .o size by 0.04% (~3MB).

I don't think linker support for dwarf5 is mature enough on Darwin for me to measure. Perhaps that measurement wouldn't be very useful, as ELF relocations are different.

Right, an ELF measurement is what I was interested in, owing to the different relocations & .rela.debug_addr is especially significant to my use case involving split DWARF and compressed DWARF, so some of the largest remaining content is debug_addr relocations.

Did I understand correctly from previous discussions that the goal was to have call_sites for /every/ call, at some point? (I'm concerned that'll be a /lot/ of addresses)

Yes. When tuning for lldb, we've been emitting call sites for every call for some time now.

But they haven't had addresses on them?

Hmm, that doesn't seem to reproduce with a simple example:

void f1();
void f2() {
  f1();
}

Doesn't have any call_site (compiled with clang with -g -c for linux/x86 or darwin/x86).

Hm :(. Digging through the history, I see that I accidentally turned on DIFlagAllCallsDescribed when targeting -gdwarf-4 + GDB in D69743 (November 2019). This was done prematurely, I apologize for this. It should not have happened until the entry values feature got enabled by default. I've checked in test coverage for -gdwarf-4 + -ggdb to clang/test/CodeGenCXX/dbg-info-all-calls-described.cpp in 47622efc.

@dblaikie @djtodoro I'd prefer to leave things as they are, but am also open to disabling DIFlagAllCallsDescribed for -gdwarf-4 + -ggdb until entry values get re-enabled by default, let me know what you think.

Sure, happy to leave things as-is.

vsk added a comment.Mar 18 2020, 4:27 PM
In D76336#1929994, @vsk wrote:

(general caveat: this should be committed as separate patches - between llvm and llvm-dsymutil (much like llvm-dwarfdump and llvm codegen are usually committed separately) - but yeah, it's nice to see it all together for review))

Sounds good, I'll split this up before committing.

Do you happen to have numbers for, say, clang, on the growth of object size because of this - specifically I'd be interested in the growth of the .rela.debug_addr section in a DWARFv5 build.

I looked at the size impact on a Darwin -gdwarf-4 stage2 clang build. The aggregate size of .o's post-patch grows 0.006% (8367046264 bytes -> 8367544576 bytes, a 486KB increase). Tail calls make up ~5% of all calls (in clang at least), so this is in line with what was measured in D72489. In D72489, I added a relocation to every non-tail call site entry, which increased the aggregate .o size by 0.04% (~3MB).

I don't think linker support for dwarf5 is mature enough on Darwin for me to measure. Perhaps that measurement wouldn't be very useful, as ELF relocations are different.

Right, an ELF measurement is what I was interested in, owing to the different relocations & .rela.debug_addr is especially significant to my use case involving split DWARF and compressed DWARF, so some of the largest remaining content is debug_addr relocations.

I'll find a linux system and share some numbers tomorrow.

Did I understand correctly from previous discussions that the goal was to have call_sites for /every/ call, at some point? (I'm concerned that'll be a /lot/ of addresses)

Yes. When tuning for lldb, we've been emitting call sites for every call for some time now.

But they haven't had addresses on them?

They have had addresses, DW_TAG_call_site may contain DW_AT_call_return_pc, which is an address as of D72489.

Hmm, that doesn't seem to reproduce with a simple example:

void f1();
void f2() {
  f1();
}

Doesn't have any call_site (compiled with clang with -g -c for linux/x86 or darwin/x86).

Have you tried adding '-O1 -disable-llvm-passes'? Call site entries aren't emitted unless optimization is enabled.

Hm :(. Digging through the history, I see that I accidentally turned on DIFlagAllCallsDescribed when targeting -gdwarf-4 + GDB in D69743 (November 2019). This was done prematurely, I apologize for this. It should not have happened until the entry values feature got enabled by default. I've checked in test coverage for -gdwarf-4 + -ggdb to clang/test/CodeGenCXX/dbg-info-all-calls-described.cpp in 47622efc.

@dblaikie @djtodoro I'd prefer to leave things as they are, but am also open to disabling DIFlagAllCallsDescribed for -gdwarf-4 + -ggdb until entry values get re-enabled by default, let me know what you think.

Sure, happy to leave things as-is.

In D76336#1930089, @vsk wrote:
In D76336#1929994, @vsk wrote:

(general caveat: this should be committed as separate patches - between llvm and llvm-dsymutil (much like llvm-dwarfdump and llvm codegen are usually committed separately) - but yeah, it's nice to see it all together for review))

Sounds good, I'll split this up before committing.

Do you happen to have numbers for, say, clang, on the growth of object size because of this - specifically I'd be interested in the growth of the .rela.debug_addr section in a DWARFv5 build.

I looked at the size impact on a Darwin -gdwarf-4 stage2 clang build. The aggregate size of .o's post-patch grows 0.006% (8367046264 bytes -> 8367544576 bytes, a 486KB increase). Tail calls make up ~5% of all calls (in clang at least), so this is in line with what was measured in D72489. In D72489, I added a relocation to every non-tail call site entry, which increased the aggregate .o size by 0.04% (~3MB).

I don't think linker support for dwarf5 is mature enough on Darwin for me to measure. Perhaps that measurement wouldn't be very useful, as ELF relocations are different.

Right, an ELF measurement is what I was interested in, owing to the different relocations & .rela.debug_addr is especially significant to my use case involving split DWARF and compressed DWARF, so some of the largest remaining content is debug_addr relocations.

I'll find a linux system and share some numbers tomorrow.

Thanks!

Did I understand correctly from previous discussions that the goal was to have call_sites for /every/ call, at some point? (I'm concerned that'll be a /lot/ of addresses)

Yes. When tuning for lldb, we've been emitting call sites for every call for some time now.

But they haven't had addresses on them?

They have had addresses, DW_TAG_call_site may contain DW_AT_call_return_pc, which is an address as of D72489.

"may"? or does in all cases? So this review is about changing some of those (specifically the ones on tail calls) call_return_pc to call_pc?

Hmm, that doesn't seem to reproduce with a simple example:

void f1();
void f2() {
  f1();
}

Doesn't have any call_site (compiled with clang with -g -c for linux/x86 or darwin/x86).

Have you tried adding '-O1 -disable-llvm-passes'? Call site entries aren't emitted unless optimization is enabled.

Ah, right right - thanks for the reminder. (why is this only relevant in optimized builds?) (sorry I keep repeating things with this feature - seems I still don't have it all clearly in my head :/)

djtodoro added a comment.EditedMar 19 2020, 2:34 AM

@vsk Thanks for working on this!

@vsk wrote:
@dblaikie @djtodoro I'd prefer to leave things as they are, but am also open to disabling DIFlagAllCallsDescribed for -gdwarf-4 + -ggdb until entry values get re-enabled by default, let me know what you think.

I'd keep it as is, since I am preparing the patch for enabling the entry values by default, since the patch was not the cause of the issue reported.

@dblaikie wrote:
why is this only relevant in optimized builds?

I think the main benefit of the call-site information is using it together with call-site-parameters, used for computing the actual value of parameter, even the location of the parameter was not provided (optimized-out). That improves debugging user experience when debugging optimized code. In addition, in the case of tail calls, the call_site debug info is used for printing artificial call frames for the tail calls (and tail calls are typical to optimized code?).

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
885โ€“886

GCC generates the low_pc for DWARF 4 (actually GNU extension) and for the DWARF 5 DW_AT_call_return_pc for the call_site TAGs describing the tail calls.
That is the address after the jump-like instruction for a tail call.

GDB understands that info and prints the artificial frames (the test case from the D76337):
gcc -g -O2 tail.c -o gcc-tail-example

...
(gdb) r
Starting program: gcc-tail-example
Breakpoint 1, sink () at tail.c:4
4         x++;
(gdb) bt
#0  sink () at tail.c:4
#1  0x00000000004004a6 in func3 () at tail.c:9
#2  0x00000000004004b7 in func2 () at tail.c:13
#3  0x00000000004004d6 in func1 () at tail.c:18
#4  0x0000000000400367 in main () at tail.c:22
...

//GDB version 8.3.5//
//GCC version 4.9.3//

The same happens when using the latest LLVM Trunk (with the D73534 applied), since the DWARF 4 + GDB generates the same DWARF as GCC for the example.
clang -g -O2 tail.c -o tail-example

...
(gdb) r
Starting program: tail-example
Breakpoint 1, sink () at tail.c:4
4         x++;
(gdb) bt
#0  sink () at tail.c:4
#1  0x0000000000401149 in func3 () at tail.c:9      // the DW_AT_low_pc  is 0x401149 from the corresponding call_site
#2  0x0000000000401156 in func2 () at tail.c:13
#3  0x0000000000401169 in func1 () at tail.c:18
#4  0x0000000000401176 in main () at tail.c:22
...
llvm/test/DebugInfo/MIR/X86/call-site-gnu-vs-dwarf5-attrs.mir
29

As I mentioned above, it is the address after the tail-call instruction.

vsk marked an inline comment as done.Mar 19 2020, 3:19 PM

This patch causes a 0.7% size increase in rela.debug_addr in a stage2 -gdwarf-5 build on Linux. That's ~569KB, similar to the size increase seen on Darwin (see my earlier comment). Here are the steps I took to measure:

I rented out a VPS for the day to gather the data -- if at all possible please lmk by EOD today if you're interested in a different experiment.

@dblaikie wrote:
why is this only relevant in optimized builds?

What @djtodoro said :). The features enabled by DW_TAG_call_site entries aren't useful/applicable at -O0. When call site entries /are/ useful, they always have to contain an address, otherwise the debugger can't figure out which call site entry to pick when stopped at a particular PC value.

Prior to this patch / currently, we only inserted an address into ~95% of call site entries (DW_AT_call_return_pc) -- the non-tail calling ones. We need the address of the tail-calling instruction, though, to make backtraces with artificial frames work properly.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
885โ€“886

Thanks for digging into this! This seems like a really weird quirk of gcc/gdb. To make this work, the debugger has to know that the DW_AT_call_return_pc for tail calls is bogus, and then find+subtract the correct offset from that address to to find the true call_pc. It seems simpler to just emit the correct call_pc to begin with.

In this patch, for the case where we're tuning for gdb in dwarf4 mode, I've omitted DW_AT_call_pc omission though (we'll keep emitting the strange DW_AT_call_return_pc -- I believe there's a test for this).

@dblaikie wrote:
why is this only relevant in optimized builds?

I think the main benefit of the call-site information is using it together with call-site-parameters, used for computing the actual value of parameter, even the location of the parameter was not provided (optimized-out).

That presents some interestingly tricky challenges. We have -fstandalone-debug for if you are building one part of a program with debug info an others without - but there's nothing equivalent for if you're building part of a program optimized but other parts unoptimized. And this heuristic (only emit these attributes for optimized code) assumes the caller and callee are both equally optimized/similarly compiled - which isn't necessarily true.

That improves debugging user experience when debugging optimized code. In addition, in the case of tail calls, the call_site debug info is used for printing artificial call frames for the tail calls (and tail calls are typical to optimized code?).

The tail call case is easier - since that's the caller-side. It'd probably be better to just emit that on any tail call, optimized or unoptimized code - I guess I mean, ideally the choice wouldn't be made at the frontend, but at the backend if the call ends up being a tail call.

(I'm thinking of LTO situations, attribute optnone, other things like that - the frontend doesn't really know if something is optimized or not & really you can't tell if a callee is optimized because it's in another translation unit)

But such is life/some of that has no obvious solution. Wonder what GCC does.

vsk added a comment.Mar 19 2020, 3:27 PM

@dblaikie wrote:
why is this only relevant in optimized builds?

I think the main benefit of the call-site information is using it together with call-site-parameters, used for computing the actual value of parameter, even the location of the parameter was not provided (optimized-out).

That presents some interestingly tricky challenges. We have -fstandalone-debug for if you are building one part of a program with debug info an others without - but there's nothing equivalent for if you're building part of a program optimized but other parts unoptimized. And this heuristic (only emit these attributes for optimized code) assumes the caller and callee are both equally optimized/similarly compiled - which isn't necessarily true.

That's ok though, because a debugger can handle call site entries being only partially available. I.e. there's no reason (afaik) for mixing and matching optimized/unoptimized .o's to regress debugger features enabled by call site entries.

That improves debugging user experience when debugging optimized code. In addition, in the case of tail calls, the call_site debug info is used for printing artificial call frames for the tail calls (and tail calls are typical to optimized code?).

The tail call case is easier - since that's the caller-side. It'd probably be better to just emit that on any tail call, optimized or unoptimized code - I guess I mean, ideally the choice wouldn't be made at the frontend, but at the backend if the call ends up being a tail call.

We only want to emit call site entries at tail-calling sites when the caller has debug info, though. We do that today by relying on the DIFlagAllCallsDescribed attribute, which the frontend provides. Also fwiw the artificial frames debugger feature doesn't work if only the tail-calling sites are described -- all of the calls have to be described for the debugger to reconstruct feasible paths through the call graph.

(I'm thinking of LTO situations, attribute optnone, other things like that - the frontend doesn't really know if something is optimized or not & really you can't tell if a callee is optimized because it's in another translation unit)

Hm, oh, good point. But, 99+% of the time, isn't the workflow to compile with -flto=thin + -O{1,2,3,s,z}? We handle that fine. But I guess if you're doing -O0 -disable-O0-optnone -flto, you wouldn't get call site entries. Hrm. Does that come up much? I guess we could fix that by adding a frontend flag?

In D76336#1932330, @vsk wrote:

@dblaikie wrote:
why is this only relevant in optimized builds?

I think the main benefit of the call-site information is using it together with call-site-parameters, used for computing the actual value of parameter, even the location of the parameter was not provided (optimized-out).

That presents some interestingly tricky challenges. We have -fstandalone-debug for if you are building one part of a program with debug info an others without - but there's nothing equivalent for if you're building part of a program optimized but other parts unoptimized. And this heuristic (only emit these attributes for optimized code) assumes the caller and callee are both equally optimized/similarly compiled - which isn't necessarily true.

That's ok though, because a debugger can handle call site entries being only partially available. I.e. there's no reason (afaik) for mixing and matching optimized/unoptimized .o's to regress debugger features enabled by call site entries.

Ah, I wasn't suggesting it'd regress functionality - but that the heuristic of "only emit these for optimized code" was just that, a heuristic with some false negatives (or positives, or whichever way you think of it). Partly then asking the question: is there a more accurate way we could determine when to emit these attributes, rather than using a frontend is/isn't optimized heuristic.

That improves debugging user experience when debugging optimized code. In addition, in the case of tail calls, the call_site debug info is used for printing artificial call frames for the tail calls (and tail calls are typical to optimized code?).

The tail call case is easier - since that's the caller-side. It'd probably be better to just emit that on any tail call, optimized or unoptimized code - I guess I mean, ideally the choice wouldn't be made at the frontend, but at the backend if the call ends up being a tail call.

We only want to emit call site entries at tail-calling sites when the caller has debug info, though. We do that today by relying on the DIFlagAllCallsDescribed attribute, which the frontend provides.

"when the caller has debug info" - I'm probably misunderstanding what you mean there. Of course the caller has to have debug info (a DW_TAG_subprogram for the calling function) to describe the call site, since the call site tag goes inside the caller's subprogram tag.

What I meant, not in the tail-calling case, but in other call sites, the callee might be optimized but the caller might be unoptimized, so relying on "is the caller optimized" to decide whether to describe the call site misses some cases (caller is unoptimized, callee is optimized - it'd improve the user experience if that call (to the optimized function) had a call_site with call site parameters to help debuggability of the optimized function, if I understand correctly)

Also fwiw the artificial frames debugger feature doesn't work if only the tail-calling sites are described -- all of the calls have to be described for the debugger to reconstruct feasible paths through the call graph.

(I'm thinking of LTO situations, attribute optnone, other things like that - the frontend doesn't really know if something is optimized or not & really you can't tell if a callee is optimized because it's in another translation unit)

Hm, oh, good point. But, 99+% of the time, isn't the workflow to compile with -flto=thin + -O{1,2,3,s,z}? We handle that fine. But I guess if you're doing -O0 -disable-O0-optnone -flto, you wouldn't get call site entries. Hrm. Does that come up much? I guess we could fix that by adding a frontend flag?

I was just thinking straight -O0, or attribute((optnone)) - that could produce the "caller is unoptimized but callee is optimized, so the absence of a call_site TAG is a failure of the call_site TAG heuristic (a false negative)". Not the end of the world, and I'm not sure there's a better solution than the heuristic you've got, but just articulating a problem there. How much that comes up? well, as much as optnone/O0-preserved-through-LTO comes up, which was originally implemented for Sony - apparently their users use this as a debugging technique to maintain performance of the rest of the program while making parts of it more debuggable by usincg -O0.

@dblaikie wrote:

But such is life/some of that has no obvious solution. Wonder what GCC does.

Oh yes, I agree :) GCC does the same. It generates the info only in the case of "optimized" code...

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
885โ€“886

DWARF5 + GDB tuning will have both call_pc and return_pc?

I think we should ask here like this:
(IsTail && !useGNUAnalogForDwarf5Feature(DD))...

vsk updated this revision to Diff 251768.Mar 20 2020, 2:38 PM

Per @djtodoro's suggestion, avoid emitting DW_AT_call_pc when tuning for gdb. We already emit AT_call_return_pc in this case, which seems weird/wrong to me, but that's what gdb expects to see it seems.

There's a side discussion here about how to determine when to emit call site entries). I think that ought to be considered separately. It looks like there are compelling use cases for mixing -O0 & optimized code -- if there's enough interest in this, and the size overhead looks reasonable, we might consider turning on call site entries at -O0 (possibly gated behind a flag).

In D76336#1934599, @vsk wrote:

Per @djtodoro's suggestion, avoid emitting DW_AT_call_pc when tuning for gdb. We already emit AT_call_return_pc in this case, which seems weird/wrong to me, but that's what gdb expects to see it seems.

Does it work for GDB? If so, how? Please be sure there's clear documentation (probably in a somewhat long explanatory comment in the LLVM source itself) about why this divergence is justified - does GDB need this for providing some functionality, but less than is possible with DW_AT_call_pc, so LLDB wants that so it can provide the better experience, but providing only that would mean GDB would provide a worse experience than if it has return_pc?

vsk added a comment.Mar 20 2020, 3:40 PM
In D76336#1934599, @vsk wrote:

Per @djtodoro's suggestion, avoid emitting DW_AT_call_pc when tuning for gdb. We already emit AT_call_return_pc in this case, which seems weird/wrong to me, but that's what gdb expects to see it seems.

Does it work for GDB?

Yes, @djtodoro's second-to-last comment shows an example where gdb works out the address for an artificial frame using the "return pc".

If so, how? Please be sure there's clear documentation (probably in a somewhat long explanatory comment in the LLVM source itself) about why this divergence is justified

I left a comment where DW_AT_call_pc is emitted explaining the situation: 'GDB prefers to work out what the call pc is by subtracting an offset from DW_AT_low_pc/DW_AT_call_return_pc'. I'll expand on this.

does GDB need this for providing some functionality, but less than is possible with DW_AT_call_pc, so LLDB wants that so it can provide the better experience, but providing only that would mean GDB would provide a worse experience than if it has return_pc?

gdb works backwards from non-standard usage of DW_AT_low_pc/DW_AT_call_return_pc at tail-call site entries to figure out the PC of tail-calling branch instructions. This means it doesn't need the compiler to emit DW_AT_call_pc, so we don't emit it (this change happened in my last patch update -- incidentally this means this patch no longer has any effect on debug info emission when tuning for gdb).

However, there isn't a good reason to tie non-gdb debuggers to this non-standardness, as it adds unnecessary complexity to the debugger. It forces the debugger to disassemble around the fake "return pc" to find the actual tail-calling branch. In the case where the tail-calling branch is the last instruction in the function, I'm not sure how that would work, as the fake "return pc" wouldn't necessarily point anywhere meaningful. To side-step that, we emit DW_AT_call_pc for non-gdb debuggers.

In D76336#1934726, @vsk wrote:
In D76336#1934599, @vsk wrote:

Per @djtodoro's suggestion, avoid emitting DW_AT_call_pc when tuning for gdb. We already emit AT_call_return_pc in this case, which seems weird/wrong to me, but that's what gdb expects to see it seems.

Does it work for GDB?

Yes, @djtodoro's second-to-last comment shows an example where gdb works out the address for an artificial frame using the "return pc".

If so, how? Please be sure there's clear documentation (probably in a somewhat long explanatory comment in the LLVM source itself) about why this divergence is justified

I left a comment where DW_AT_call_pc is emitted explaining the situation: 'GDB prefers to work out what the call pc is by subtracting an offset from DW_AT_low_pc/DW_AT_call_return_pc'. I'll expand on this.

Should we do that only in DWARFv4, and leave DWARFv5 standard & let GDB implement the missing functionality to understand the standard form?

does GDB need this for providing some functionality, but less than is possible with DW_AT_call_pc, so LLDB wants that so it can provide the better experience, but providing only that would mean GDB would provide a worse experience than if it has return_pc?

gdb works backwards from non-standard usage of DW_AT_low_pc/DW_AT_call_return_pc at tail-call site entries to figure out the PC of tail-calling branch instructions. This means it doesn't need the compiler to emit DW_AT_call_pc, so we don't emit it (this change happened in my last patch update -- incidentally this means this patch no longer has any effect on debug info emission when tuning for gdb).

The language in the DWARF spec is, as always, pretty vague/general. But, yes, the call_pc wording in the spec (which includes a mention of jumps/tail calls) seems to be in contrast to the call_return_pc that only mentions calls, so I get where you're coming from.

However, there isn't a good reason to tie non-gdb debuggers to this non-standardness, as it adds unnecessary complexity to the debugger. It forces the debugger to disassemble around the fake "return pc" to find the actual tail-calling branch. In the case where the tail-calling branch is the last instruction in the function, I'm not sure how that would work, as the fake "return pc" wouldn't necessarily point anywhere meaningful. To side-step that, we emit DW_AT_call_pc for non-gdb debuggers.

the "I'm not sure how that would work" bit is sort of concerning to me - it does work for GDB, right? (I'd guess it points to the end of the instruction (same as the "high_pc" of a function points to the end of (or, one passed the end in both cases) of the function))

I'm not going to/don't mean to hold any of this work up, just curious.

I guess if GDB does add support for call_pc, it'll likely keep its extension support of call_return_pc as well so it'll keep working with the non-standard emission for GDB you're proposing, and then we can fix-forward & remove this workaround. Has someone filed the bug on GDB for this missing functionality?

vsk added a comment.Mar 20 2020, 4:58 PM
In D76336#1934726, @vsk wrote:
In D76336#1934599, @vsk wrote:

Per @djtodoro's suggestion, avoid emitting DW_AT_call_pc when tuning for gdb. We already emit AT_call_return_pc in this case, which seems weird/wrong to me, but that's what gdb expects to see it seems.

Does it work for GDB?

Yes, @djtodoro's second-to-last comment shows an example where gdb works out the address for an artificial frame using the "return pc".

If so, how? Please be sure there's clear documentation (probably in a somewhat long explanatory comment in the LLVM source itself) about why this divergence is justified

I left a comment where DW_AT_call_pc is emitted explaining the situation: 'GDB prefers to work out what the call pc is by subtracting an offset from DW_AT_low_pc/DW_AT_call_return_pc'. I'll expand on this.

Should we do that only in DWARFv4, and leave DWARFv5 standard & let GDB implement the missing functionality to understand the standard form?

Sure, I think this is a nice opportunity to reduce debugger-specific divergence. I'll include that in the next update.

does GDB need this for providing some functionality, but less than is possible with DW_AT_call_pc, so LLDB wants that so it can provide the better experience, but providing only that would mean GDB would provide a worse experience than if it has return_pc?

gdb works backwards from non-standard usage of DW_AT_low_pc/DW_AT_call_return_pc at tail-call site entries to figure out the PC of tail-calling branch instructions. This means it doesn't need the compiler to emit DW_AT_call_pc, so we don't emit it (this change happened in my last patch update -- incidentally this means this patch no longer has any effect on debug info emission when tuning for gdb).

The language in the DWARF spec is, as always, pretty vague/general. But, yes, the call_pc wording in the spec (which includes a mention of jumps/tail calls) seems to be in contrast to the call_return_pc that only mentions calls, so I get where you're coming from.

However, there isn't a good reason to tie non-gdb debuggers to this non-standardness, as it adds unnecessary complexity to the debugger. It forces the debugger to disassemble around the fake "return pc" to find the actual tail-calling branch. In the case where the tail-calling branch is the last instruction in the function, I'm not sure how that would work, as the fake "return pc" wouldn't necessarily point anywhere meaningful. To side-step that, we emit DW_AT_call_pc for non-gdb debuggers.

the "I'm not sure how that would work" bit is sort of concerning to me - it does work for GDB, right? (I'd guess it points to the end of the instruction (same as the "high_pc" of a function points to the end of (or, one passed the end in both cases) of the function))

I was hedging a bit because I haven't/can't read the gdb sources. I expect that the implementation does something to the effect of:

tail_call_pc = call_return_pc-1;
tail_call_pc -= sizeOfInstAt(tail_call_pc)

in which case handling a tail call at the end of a function shouldn't be problematic. Part of why I hedged is that I don't know what impact (if any) post-link function reordering tools might have on DW_AT_call_return_pc. Hypothetically, if the encoded address is one past the end of caller function, and the post-link tool fastidiously updates addresses in DWARF sections which point within moved functions, the non-standard trick gdb uses would stop working (otoh DW_AT_call_pc would get updated correctly).

I'm not going to/don't mean to hold any of this work up, just curious.

I guess if GDB does add support for call_pc, it'll likely keep its extension support of call_return_pc as well so it'll keep working with the non-standard emission for GDB you're proposing, and then we can fix-forward & remove this workaround. Has someone filed the bug on GDB for this missing functionality?

I did a quick scan of the gdb bug database but couldn't find anything relevant. If anyone following along can confirm gdb hasn't added support for DW_AT_call_pc, I'd appreciate it if you could file a bug.

vsk updated this revision to Diff 251805.Mar 20 2020, 5:07 PM

Add more detailed comments, and always emit DW_AT_call_pc in DWARF5 mode (even when tuning for GDB).

does GDB need this for providing some functionality, but less than is possible with DW_AT_call_pc, so LLDB wants that so it can provide the better experience, but providing only that would mean GDB would provide a worse experience than if it has return_pc?

gdb works backwards from non-standard usage of DW_AT_low_pc/DW_AT_call_return_pc at tail-call site entries to figure out the PC of tail-calling branch instructions. This means it doesn't need the compiler to emit DW_AT_call_pc, so we don't emit it (this change happened in my last patch update -- incidentally this means this patch no longer has any effect on debug info emission when tuning for gdb).

The language in the DWARF spec is, as always, pretty vague/general. But, yes, the call_pc wording in the spec (which includes a mention of jumps/tail calls) seems to be in contrast to the call_return_pc that only mentions calls, so I get where you're coming from.

However, there isn't a good reason to tie non-gdb debuggers to this non-standardness, as it adds unnecessary complexity to the debugger. It forces the debugger to disassemble around the fake "return pc" to find the actual tail-calling branch. In the case where the tail-calling branch is the last instruction in the function, I'm not sure how that would work, as the fake "return pc" wouldn't necessarily point anywhere meaningful. To side-step that, we emit DW_AT_call_pc for non-gdb debuggers.

the "I'm not sure how that would work" bit is sort of concerning to me - it does work for GDB, right? (I'd guess it points to the end of the instruction (same as the "high_pc" of a function points to the end of (or, one passed the end in both cases) of the function))

I was hedging a bit because I haven't/can't read the gdb sources. I expect that the implementation does something to the effect of:

tail_call_pc = call_return_pc-1;
tail_call_pc -= sizeOfInstAt(tail_call_pc)

in which case handling a tail call at the end of a function shouldn't be problematic. Part of why I hedged is that I don't know what impact (if any) post-link function reordering tools might have on DW_AT_call_return_pc. Hypothetically, if the encoded address is one past the end of caller function, and the post-link tool fastidiously updates addresses in DWARF sections which point within moved functions, the non-standard trick gdb uses would stop working (otoh DW_AT_call_pc would get updated correctly).

An implementation couldn't resolve addresses like that, or it'd break the normal DW_AT_high_pc, which also points to the same address (one past the end of the function).

In D76336#1934828, @vsk wrote:

Add more detailed comments, and always emit DW_AT_call_pc in DWARF5 mode (even when tuning for GDB).

I like this. Actually, I was for this option.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
885โ€“886

I still think we can avoid the PCAddr here as (IsTail && !useGNUAnalogForDwarf5Feature(DD)), if we want to use the call_pc for gdb + dwarf5 tunning?
And also update the comment above.

vsk updated this revision to Diff 252135.Mar 23 2020, 1:59 PM
vsk marked 2 inline comments as done.

Address review feedback.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
885โ€“886

Ah right, sorry I missed this.

djtodoro accepted this revision.Mar 24 2020, 7:40 AM

Thanks! lgtm (from my side)

This revision is now accepted and ready to land.Mar 24 2020, 7:40 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. ยท View Herald TranscriptMar 24 2020, 12:22 PM