Page MenuHomePhabricator

[Coroutines] Add missing llvm.dbg.declare's to cover more allocas
ClosedPublic

Authored by bruno on Nov 4 2020, 10:00 AM.

Details

Summary

Tracking local variables across suspend points is still somewhat incomplete. Consider this coroutine:

// Complete code here: https://gist.github.com/bcardosolopes/a992950bdfc66ab4ce6dc0c75920f4ef
resumable foo() {
  int x[10] = {};
  int a = 3;
  co_await std::experimental::suspend_always();
  a++;
  x[0] = 1;
  a += 2;
  x[1] = 2;
  a += 3;
  x[2] = 3;
}

Can't manage to print a or x if they turn out to be allocas during CoroSplit (which happens if you build this code with -O0 against ToT):

* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000100003729 main-noprint`foo() at main-noprint.cpp:43:5
   40     co_await std::experimental::suspend_always();
   41     a++;
   42     x[0] = 1;
-> 43     a += 2;
   44     x[1] = 2;
   45     a += 3;
   46     x[2] = 3;
(lldb) p x
error: <user expression 21>:1:1: use of undeclared identifier 'x'
x
^

The generated IR contains a llvm.dbg.declare for x in it's initialization basic block. However, even though this BB dominates all other BBs where x is manipulated, that doesn't seem to be enough debug info for the debugger to be happy. By adding extra llvm.dbg.declares in these BBs, lldb prints x successfully. Is this how llvm.dbg.declares supposed to work or am I missing something? Given the perceived behavior, this patch improves CoroSplit by placing extra llvm.dbg.declares in all basic blocks that need some "refresh" for the frame location to be found, so this:

await.ready:
  ...
  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 0, !dbg !760
  ...
  %arrayidx19 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 1, !dbg !763
  ...
  %arrayidx21 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 2, !dbg !766

becomes:

await.ready:
  ...
  call void @llvm.dbg.declare(metadata [10 x i32]* %x.reload.addr, metadata !751, metadata !DIExpression()), !dbg !753
  ...
  %arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 0, !dbg !760
  ...
  %arrayidx19 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 1, !dbg !763
  ...
  %arrayidx21 = getelementptr inbounds [10 x i32], [10 x i32]* %x.reload.addr, i64 0, i64 2, !dbg !766

For additional context, this builds up on top of changes from D75338 back in Feb. I also plan to add a LLDB end-to-end test for coroutines in a followup patch once this is fixed.

Diff Detail

Event Timeline

bruno created this revision.Nov 4 2020, 10:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2020, 10:00 AM
bruno requested review of this revision.Nov 4 2020, 10:00 AM

Thank you for working on this!

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
1226

Do we need to insert to every BB that uses it though? Though this may be the safest way to guarantee there is at least one, so I don't object doing this.
Also want to point out this: https://llvm.org/docs/SourceLevelDebugging.html#llvm-dbg-declare
it says there can only be one dbg.declare. However in practice I think as long as they all look the same it should be fine.

davide removed a subscriber: davide.Nov 4 2020, 11:38 AM
bruno added inline comments.Nov 4 2020, 12:00 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
1226

Do we need to insert to every BB that uses it though? Though this may be the safest way to guarantee there is at least one, so I don't object doing this.

That's a good question, I initially thought a dbg.declare dominating the BB in question would be enough, perhaps the debugging experts could help clarify what's going on.

Also want to point out this: https://llvm.org/docs/SourceLevelDebugging.html#llvm-dbg-declare
it says there can only be one dbg.declare. However in practice I think as long as they all look the same it should be fine.

Yep, I also saw this but didn't get any verifier complains (perhaps because they look the same), guess we have another question for the experts :)

bruno wrote:

That's a good question, I initially thought a dbg.declare dominating the BB in question would be enough, perhaps the debugging experts could help clarify what's going on.

Hrrrmmmm. Just to confirm my understanding, in tests like the one modified in this patch, all those allocas (%i, %j, %x) are transformed into GEPs from FramePtr, which is malloc'd / new'd or otherwise not on the stack?

Normally dbg.declare is supposed to refer to a static alloca -- that way we can easily identify stack variables and not bother tracking them through optimisations. Sometimes this isn't true though (certain ABIs pass in memory for return values apparently), in which case the dbg.declare gets silently transformed into something like dbg.value(%0, [...], DW_OP_deref), which is what seems to happen here.

That should be fine: however as you've already spotted, the debug intrinsic needs to dominate the instructions that receive the variable location, and that doesn't seem to be the case in the given example. Here's a gist with the input test.cpp at the bottom, and IR for foo.resume on IR line 693 onwards. I've compiled this with a ~6 week old clang master and -emit-llvm -S -g -O0 -fcoroutines-ts -stdlib=libc++. I'm not familiar with the coroutine implementation details and assumed:

  • foo.resume is the function we care about as it's the first function gdb stops in when I set breakpoints in the "foo" function,
  • The await.ready block is the main part of the foo.ready function, and everything else is coroutine plumbing.

From the entry block, there's a path through the blocks thus:

resume.entry
resume.1
resume.1.landing
AfterCoroSuspend50
await.ready

Which reaches the await.ready block without going through any debug intrinsics. The variable locations won't be propagated into the await.ready block because of this path, as there's no information about variable locations on it. This matches what dwarfdump reports: there's a variable location for _some_ of the function, but not all of it, and not the part that the developer wants to step through.

I'm not familiar with the coroutine implementation details, glad to be corrected. (CC @Orlando @chrisjackson , this is an example of where redesigned intrinsics would definitely need to refer to variables in an arbitrary memory location, ouch).

bruno added a comment.Nov 5 2020, 6:56 PM

Hi Jeremy, thanks for taking a deep look here.

Hrrrmmmm. Just to confirm my understanding, in tests like the one modified in this patch, all those allocas (%i, %j, %x) are transformed into GEPs from FramePtr, which is malloc'd / new'd or otherwise not on the stack?

That's right.

Normally dbg.declare is supposed to refer to a static alloca -- that way we can easily identify stack variables and not bother tracking them through optimisations. Sometimes this isn't true though (certain ABIs pass in memory for return values apparently), in which case the dbg.declare gets silently transformed into something like dbg.value(%0, [...], DW_OP_deref), which is what seems to happen here.

Can you clarify what you mean by "silently transformed into something like dbg.value..."? In what stage do you see that happening? Those don't show up for me unless the local variables are already promoted to regs.

That should be fine: however as you've already spotted, the debug intrinsic needs to dominate the instructions that receive the variable location, and that doesn't seem to be the case in the given example. Here's a gist with the input test.cpp at the bottom, and IR for foo.resume on IR line 693 onwards. I've compiled this with a ~6 week old clang master and -emit-llvm -S -g -O0 -fcoroutines-ts -stdlib=libc++. I'm not familiar with the coroutine implementation details and assumed:

  • foo.resume is the function we care about as it's the first function gdb stops in when I set breakpoints in the "foo" function,
  • The await.ready block is the main part of the foo.ready function, and everything else is coroutine plumbing.

Yep.

From the entry block, there's a path through the blocks thus:

resume.entry
resume.1
resume.1.landing
AfterCoroSuspend50
await.ready

Which reaches the await.ready block without going through any debug intrinsics. The variable locations won't be propagated into the await.ready block because of this path, as there's no information about variable locations on it. This matches what dwarfdump reports: there's a variable location for _some_ of the function, but not all of it, and not the part that the developer wants to step through.

If you look at the final IR, this is true and it actually confirms why it wasn't working before, thanks! However, at the point where insertSpills is called and the logic on this patch runs, there's no resume function just yet, since it's operating on the original foo, and in that context init.ready (which already has a dbg.declares) still dominates await.ready.

Given that, another approach would be to redo this logic on top of the already formed resume function and only re-insert the dbg.declares in BBs that are not dominated. One possible drawback of this approach is that it would require applying the same logic to init, resume and destroy and going for another CFG walk for each alloca spill. Doesn't sound super expensive but the convenience on doing it in the original function instead sounds compelling. What's the impact of having reductant dbg.declares in already dominated paths? Is there a pass that cleans those up? Thoughts?

I'm not familiar with the coroutine implementation details, glad to be corrected. (CC @Orlando @chrisjackson , this is an example of where redesigned intrinsics would definitely need to refer to variables in an arbitrary memory location, ouch).

This patch works for llvm.dbg.addr. So what is status of llvm.dbg.addr? does anyone know about it ?

bruno added a comment.Nov 6 2020, 10:59 AM

This patch works for llvm.dbg.addr. So what is status of llvm.dbg.addr? does anyone know about it ?

Note sure I follow your comment, did you mean "works for llvm.dbg.declare, what about llvm.dbg.addr"? My quick code search here seems like the only way to get llvm.dbg.addrs are via the llvm option -use-dbg-addr. This option doesn't seem to be used by clang (maybe some other frontends do) and using it means changing getDeclareIntrin() to give you .addrs instead of .declare, so I assume this change should work for both?

Hi Bruno,

Can you clarify what you mean by "silently transformed into something like dbg.value..."? In what stage do you see that happening? Those don't show up for me unless the local variables are already promoted to regs.

During instruction selection, so quite late: if you run "llc -stop-before=finalize-isel" on some IR that contains a dbg.declare, if it refers to a stack slot then the variable location is attached to the stack slot:

stack:
  - { id: 0, name: __x.addr, type: default, offset: 0, size: 8, alignment : 8,
      stack-id: default, callee-saved-register: '', callee-saved-restored : true,
      debug-info-variable: '!692', debug-info-expression: '!DIExpression()',
      debug-info-location: '!693' }

However if the dbg.declare can't be tracked back to a stack slot, it becomes a "DBG_VALUE" machine instruction (equivalent to a dbg.value intrinsic):

DBG_VALUE %26, 0, !704, !DIExpression(), debug-location !706

The former doesn't need to worry about control flow because the variable is always homed in a stack slot; the latter does need to worry about control flow. I imagine that everything to do with coroutines will take the latter path.

What's the impact of having reductant dbg.declares in already dominated paths? Is there a pass that cleans those up? Thoughts?

Zero functional change, and a tiny performance cost from having one extra metadata instruction hanging around. If you go down this route, I'd recommend using the dbg.value intrinsic with a DIExpression with a single DW_OP_deref -- this is effectively what dbg.declare becomes as shown above, and avoids any unexpected surprises if it turns out some code somewhere really does expect only one dbg.declare to exist.

junparser wrote:

This patch works for llvm.dbg.addr. So what is status of llvm.dbg.addr? does anyone know about it ?

In theory this kind of behaviour is exactly what dbg.addr is for (variable lives in memory, maybe changes due to control flow). However moving everything to use dbg.addr stalled a long time ago, I believe @Orlando found that it's often unexpectedly dropped by optimisations, any opinion on whether it's usable @Orlando? We're hoping to use it for things someday soon, but we're definitely not there yet.

bruno added a comment.Nov 6 2020, 3:31 PM

Hi Jeremy,

However if the dbg.declare can't be tracked back to a stack slot, it becomes a "DBG_VALUE" machine instruction (equivalent to a dbg.value intrinsic):

DBG_VALUE %26, 0, !704, !DIExpression(), debug-location !706

The former doesn't need to worry about control flow because the variable is always homed in a stack slot; the latter does need to worry about control flow. I imagine that everything to do with coroutines will take the latter path.

I see now, thanks for sharing!

What's the impact of having reductant dbg.declares in already dominated paths? Is there a pass that cleans those up? Thoughts?

Zero functional change, and a tiny performance cost from having one extra metadata instruction hanging around. If you go down this route, I'd recommend using the dbg.value intrinsic with a DIExpression with a single DW_OP_deref -- this is effectively what dbg.declare becomes as shown above, and avoids any unexpected surprises if it turns out some code somewhere really does expect only one dbg.declare to exist.

Gotcha, I like your recommendation, let's prevent unexpected surprises. I've tested the approach and works the same in the final debugging experience. Will update the patch.

bruno updated this revision to Diff 303595.Nov 6 2020, 5:17 PM

Update comments and now use dbg.values after @jmorse feedback.

lxfind added a comment.Nov 6 2020, 6:41 PM

Could you add a test (or update existing tests) to demonstrate that the issue is fixed?

Jeremy said:

In theory this kind of behaviour is exactly what dbg.addr is for (variable lives in memory, maybe changes due to control flow). However moving everything to use dbg.addr stalled a long time ago, I believe @Orlando found that it's often unexpectedly dropped by optimisations, any opinion on whether it's usable @Orlando? We're hoping to use it for things someday soon, but we're definitely not there yet.

Yes that's right. It wasn't a particularly deep dive but from recent experience I would advise avoiding dbg.addr for now.

bruno added a comment.Nov 9 2020, 11:26 AM

Could you add a test (or update existing tests) to demonstrate that the issue is fixed?

The testcase was updated as part of the last diff update, anything specific you are looking for?

Could you add a test (or update existing tests) to demonstrate that the issue is fixed?

The testcase was updated as part of the last diff update, anything specific you are looking for?

Sorry never mind. I wasn't looking at it correctly.

lxfind accepted this revision.Nov 9 2020, 5:25 PM

LGTM. Thanks!

This revision is now accepted and ready to land.Nov 9 2020, 5:25 PM

Hi Bruno,
One of the another issue is that it is also necessary to track coroutine function parameters correctly under O1/O2 level which use dbg.value. Thoughts?

bruno added a comment.Nov 9 2020, 6:18 PM

Hi @junparser,

One of the another issue is that it is also necessary to track coroutine function parameters correctly under O1/O2 level which use dbg.value. Thoughts?

I haven't looked up the coroutine function parameter debugging quality yet, does it even work fine for -O0? I'm trying to go over the issues I find for -O0 first and incrementally improve on top of that. Any specific testcase in mind?

Hi @junparser,

One of the another issue is that it is also necessary to track coroutine function parameters correctly under O1/O2 level which use dbg.value. Thoughts?

I haven't looked up the coroutine function parameter debugging quality yet, does it even work fine for -O0? I'm trying to go over the issues I find for -O0 first and incrementally improve on top of that. Any specific testcase in mind?

I'm not sure about it, but it should not work fine for O1 above. I do not have testcase right now, you can just change your testcase to add some parameters and use them.