This is an archive of the discontinued LLVM Phabricator instance.

[Docs][RFC] Add AMDGPU LLVM Extensions for Heterogeneous Debugging
Needs ReviewPublic

Authored by scott.linder on Nov 28 2022, 2:58 PM.

Details

Summary

Add document which introduces, motivates, and defines debug info
extensions designed to support heterogeneous compute and improve debug
information for all targets.

An accompanying RFC, along with an implementation for AMDGPU with
optimizations disabled will follow.

Diff Detail

Event Timeline

scott.linder created this revision.Nov 28 2022, 2:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 2:58 PM
scott.linder requested review of this revision.Nov 28 2022, 2:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 2:58 PM

This is a very large document and I admit I've only briefly skimmed it. But I wonder whether the redesign of DIExpr in terms of new DIOpXyz (instead of DW_OP_Xyz) is introducing too many trees to let the forest be visible. Finding a way to split this up into more digestible and reviewable chunks would be really wonderful. The DIExpr redesign in particular (I'd expect) would be incredibly intrusive to implement, and I think you'd need to demonstrate that this redesign is absolutely necessary (cannot be implemented with the existing DWARF-like expressions perhaps augmented with some new DW_OP_LLVM_yyy operators).

Based on the Dev Meeting talk, the explicit notions of fragment and lifetime sounded appealing, and worth looking at independent of any "heterogeneous" aspect. AMDGPU's needs might have been the motivation, but the problems addressed by these notions are generic. I'd expect these solutions should be broadly applicable. It's especially exciting to see ways forward to support multiple concurrent locations, something that already comes up as a problem and would greatly benefit the debugging experience even in a scalar or homogeneous environment.

llvm/docs/AMDGPULLVMExtensionsForHeterogeneousDebugging.rst
44

I wouldn't go that far. LLVM does in fact promise to be able to read older versions of bitcode, necessarily including older versions of debug-info metadata. so at the very least you'd need to be able to auto-upgrade older metadata into the new format.

scott.linder added a comment.EditedNov 29 2022, 12:31 PM

This is a very large document and I admit I've only briefly skimmed it. But I wonder whether the redesign of DIExpr in terms of new DIOpXyz (instead of DW_OP_Xyz) is introducing too many trees to let the forest be visible. Finding a way to split this up into more digestible and reviewable chunks would be really wonderful. The DIExpr redesign in particular (I'd expect) would be incredibly intrusive to implement, and I think you'd need to demonstrate that this redesign is absolutely necessary (cannot be implemented with the existing DWARF-like expressions perhaps augmented with some new DW_OP_LLVM_yyy operators).

This is a great point, and one that we considered before we decided to design a new expression node. I will be posting actual code which implements these things in an incremental fashion, so hopefully it will make it easier to reason about and relate to what we have, but I will also try to defend the approach here (sorry for the continued verbosity, I'm trying to keep things short without losing any nuance):

One reason for the switch is to enable a more type-safe, extensible, and ergonomic interface to the expressions. As it stands, there is no clean abstraction for the expression: it is transparently just a std::vector<uint64_t>. There are some methods on DIExpression and some free functions in DebugInfoMetadata.h that can more-or-less spare the developer from having to deal directly in this representation, but it is leaky abstraction.

Even if we discount the benefits in ergonomics, there is still an issue with composability: because the fundamental unit of abstraction is a DIExpression * and we can only operate on it via methods and free functions, those must return a fully uniqued DIExpression *, even if the user intends to compose many operations to arrive at the final expression they want. For example, in PrologEpilogInserter.cpp:

unsigned PrependFlags = DIExpression::ApplyOffset;
if (!MI.isIndirectDebugValue() && !DIExpr->isComplex())
  PrependFlags |= DIExpression::StackValue;
if (MI.isIndirectDebugValue() && DIExpr->isImplicit()) {
  SmallVector<uint64_t, 2> Ops = {dwarf::DW_OP_deref_size, Size};
  bool WithStackValue = true;
  DIExpr = DIExpression::prependOpcodes(DIExpr, Ops, WithStackValue);
  // Make the DBG_VALUE direct.
  MI.getDebugOffset().ChangeToRegister(0, false);
}
DIExpr = TRI.prependOffsetExpression(DIExpr, PrependFlags, Offset);

The intermediate DIExpression * returned from the call to prependOpcodes goes through the whole uniqueing mechanism and is then immediately discarded once used as input to prependOffsetExpression. The caller (who just wants to write their pass, not fiddle with debug info) either needs to go factor out the implementation of these two methods into something that operates on std::vector<uint64_t>, and then combine them into a new prependOpcodesAndThenPrependOffsetExpression, or else expose the clunky std::vector<uint64_t> representation directly in their pass code. I don't think either solution is acceptable, which means we just have objectively worse code every time. For example, this code in TargetInstrInfo.cpp deals with std::vector<uint64_t> to avoid the performance cost:

SmallVector<uint64_t, 8> Ops;
DIExpression::appendOffset(Ops, Offset);
Ops.push_back(dwarf::DW_OP_deref_size);
Ops.push_back(MMO->getSize());
Expr = DIExpression::prependOpcodes(Expr, Ops);
return ParamLoadedValue(*BaseOp, Expr);

The rationale for which methods are written at which level of abstraction is unclear, and at best we would essentially need two versions of every method.

Our solution was to introduce an abstraction for both:

  • An "at rest", uniqued expression (DIExpr)
  • A "mutable" expression builder (DIExprBuilder)

Currently we have the same underlying representation for both, so going from DIExpr to DIExprBuilder is a memcpy and going from DIExprBuilder to DIExpr is a pointer copy, but we believe this leaves us the room to improve e.g. the compactness of the "at rest" representation at the expense of some extra work to interconvert, and we will be able to do this without changing any client code.

Once we arrived at this design, we realized that even if we stuck with DIExpression and added a DIExpressionBuilder we needed to disrupt nearly all code which deals with expressions to use the new interfaces. At that point, we decided it was worth also striving for the ergonomic benefits of doing away with the std::vector<uint64_t> completely and using a std::variant-like interface.

Based on the Dev Meeting talk, the explicit notions of fragment and lifetime sounded appealing, and worth looking at independent of any "heterogeneous" aspect. AMDGPU's needs might have been the motivation, but the problems addressed by these notions are generic. I'd expect these solutions should be broadly applicable. It's especially exciting to see ways forward to support multiple concurrent locations, something that already comes up as a problem and would greatly benefit the debugging experience even in a scalar or homogeneous environment.

Yes, we agree 100%, and the "heterogeneous" naming has only persisted because we haven't divined a better name for the work as a whole. Suggestions are welcome!

llvm/docs/AMDGPULLVMExtensionsForHeterogeneousDebugging.rst
44

It may be too strong, but the idea was that it is currently always "correct" to drop debug information. This is probably something that warrants a broader discussion, as that is admittedly a pretty unhelpful approach for anyone actually relying on these upgrades for anything with debug information, but I don't see the use case for upgrading debug info in bitcode at rest.

To be more precise, the situations I see are:

  • Source is available, in which case the next run of the front-end "upgrades" you to the new debug info metadata; no compatibility concerns.
  • Source is not available, in which case I don't immediately see what "debug info" would even be describing? Are there examples of bitcode maintained directly (i.e. "by hand") that also includes debug information? Is that debug information describing some hypothetical source language? Is it describing LLVM bitcode itself, as that is the source language?

Again, I don't doubt that there are cases where the upgrade is needed, I just can't identify them myself.

I accidentally posted while still working on a draft comment, so sorry if you saw a partial version!

Also I wanted to say thank you for attending the talk, and for following up here; I think there is still a lot of room to improve the proposal, and I will try to take your suggestion of splitting things and come back with a few smaller independent chunks. I already have most of the code broken up into smaller, self-contained pieces, so that may also help once I have those rebased and on Phabricator.

the "heterogeneous" naming has only persisted because we haven't divined a better name for the work as a whole

Yeah, "variable location tracking debug-info metadata redesign" isn't too snappy.

it is currently always "correct" to drop debug information. This is probably something that warrants a broader discussion, as that is admittedly a pretty unhelpful approach for anyone actually relying on these upgrades for anything with debug information,

That has been the historical attitude, which I blame on the academic roots of the project lo these many years ago, and is totally inappropriate for an industry product used by huge numbers of people for their daily work.

but I don't see the use case for upgrading debug info in bitcode at rest.

Your examples look like they're based on using debug info for source-level debugging by the end user. The vast majority of end users don't do any debugging, but there are lots of cases where what they're getting from the vendor is bitcode. Shipping bitcode libraries to facilitate LTO is a thing, and I can easily imagine a core dump coming back to the library provider, who will indeed have the source available. iOS apps are delivered to the app store as bitcode, and aren't necessarily re-delivered with every compiler update; yet an app crash surely wants to have source information reported back to the vendor if at all possible. Crash dump backtraces want debug info not just for source info but to allow symbolizing the trace, deducing parameters, and the like.

So, when Clang reads older bitcode, it definitely needs to upgrade debug info metadata on the fly.

jryans added a subscriber: jryans.Nov 30 2022, 9:35 AM

iOS apps are delivered to the app store as bitcode

I wanted to mention a small correction (just so you are aware)... The iOS app store no longer accepts bitcode as of Xcode 14 (see Deprecations section), so this particular route no longer applies as of 2022-09.

Anyway, that's just one use case, and you've already mentioned several others (crash dumps, LTO), so your general point still stands. I just wanted you to be aware of this recent change.

the "heterogeneous" naming has only persisted because we haven't divined a better name for the work as a whole

Yeah, "variable location tracking debug-info metadata redesign" isn't too snappy.

it is currently always "correct" to drop debug information. This is probably something that warrants a broader discussion, as that is admittedly a pretty unhelpful approach for anyone actually relying on these upgrades for anything with debug information,

That has been the historical attitude, which I blame on the academic roots of the project lo these many years ago, and is totally inappropriate for an industry product used by huge numbers of people for their daily work.

but I don't see the use case for upgrading debug info in bitcode at rest.

Your examples look like they're based on using debug info for source-level debugging by the end user. The vast majority of end users don't do any debugging, but there are lots of cases where what they're getting from the vendor is bitcode. Shipping bitcode libraries to facilitate LTO is a thing, and I can easily imagine a core dump coming back to the library provider, who will indeed have the source available. iOS apps are delivered to the app store as bitcode, and aren't necessarily re-delivered with every compiler update; yet an app crash surely wants to have source information reported back to the vendor if at all possible. Crash dump backtraces want debug info not just for source info but to allow symbolizing the trace, deducing parameters, and the like.

So, when Clang reads older bitcode, it definitely needs to upgrade debug info metadata on the fly.

Understood, I can update the document to match the reality, and start a section for the upgrade strategy. My initial feeling is that most things will be pretty straightforward to translate, we had just not considered it a requirement originally.

jmorse added a comment.Dec 2 2022, 4:19 AM

Hi Scott,

I really enjoyed the conference talk about this, and moving the issues of how variables are fragmented into smaller chunks to higher up in the metadata hierachy makes a lot of sense. It could substantially simplify + improve our tracking of variables today,

There's a lot of different things being re-engineered in this proposal, and I'd like to make sure that I have the correct understanding of how the current variable location design maps to this new one. As I understand it, dbg.values become dbg.def and dbg.kill intrinsics, connecting IR Values to DILifeime objects. The DILifetimes refer to a hierarchy of DIFragments that specify what's being defined (which is great), and the expression required to produce the variable value / location from the inputs.

After that it becomes fuzzier though: it's not obvious to me how the current variable value / location is determined when there can be (according to the document) multiple disjoint and overlapping lifetimes that are active. If a variable fragment has different runtime values, which one should we pick, and how -- or if they're supposed to always have the same value, what guarantees this during optimisation? Right now, dbg.value intrinsics are effectively an assignment to the variable [fragment], and the variable value is the last dominating dbg.value assignment (or possibly a PHI between multiple of them, determined by LiveDebugValues). What is the equivalent for these new intrinsics?

I think lifetimes and def/kill relationships makes sense after register allocation where that's the form the program is in, but it's not clear how it would work in SSA form. It's also worth noting that the multiple-locations-after-regalloc problem is solved, to a large extent, by the instruction referencing rewrite [0], essentially keeping the debugging information in SSA form and then recognising target COPY and value-movements to track the multiple locations a value can be resident.

There's value in having multiple ways of expressing variable locations, during loop-strength-reduction you can recompute a variable from the loop starting values or from the strength-reduced variables, for example. It needs to be approached with some delicacy though to save memory.

At a more abstract level, I've a worry that this might move us more in the direction of requiring more knowledge / maintenence during optimisation passes to preserve debug-info invariants, where it seems more beneficial to reduce that kind of maintenence, in compile and engineering time. It's certainly the motivation behind the assignment tracking work [1], which is inferring information about optimisations from what gets deleted rather than what gets preserved.

[0] https://www.youtube.com/watch?v=yxuwfNnp064
[1] https://discourse.llvm.org/t/rfc-assignment-tracking-a-better-way-of-specifying-variable-locations-in-ir/62367

llvm/docs/AMDGPULLVMExtensionsForHeterogeneousDebugging.rst
2041–2045

This is because redundant isn't trivially dead after this modification, causing the load to be CSE'd, which causes a RAUW of the Value that keeps the dbg.value alive. We could achieve the same results in the unmodified case by searching the function for equivalent Values whenever we delete trivially dead code and need to salvage variable locations, but it would be compile-time expensive.

2049–2073

Note that with the modification, a value is loaded from *bar and then stored back to *bar, which EarlyCSE successfully spots as being redundant, and deletes the heap store, which was the primary problem in PR40628.

Hi Scott,

I really enjoyed the conference talk about this, and moving the issues of how variables are fragmented into smaller chunks to higher up in the metadata hierachy makes a lot of sense. It could substantially simplify + improve our tracking of variables today,

There's a lot of different things being re-engineered in this proposal, and I'd like to make sure that I have the correct understanding of how the current variable location design maps to this new one. As I understand it, dbg.values become dbg.def and dbg.kill intrinsics, connecting IR Values to DILifeime objects. The DILifetimes refer to a hierarchy of DIFragments that specify what's being defined (which is great), and the expression required to produce the variable value / location from the inputs.

Yes, that all matches up with the proposal!

After that it becomes fuzzier though: it's not obvious to me how the current variable value / location is determined when there can be (according to the document) multiple disjoint and overlapping lifetimes that are active. If a variable fragment has different runtime values, which one should we pick, and how -- or if they're supposed to always have the same value, what guarantees this during optimisation? Right now, dbg.value intrinsics are effectively an assignment to the variable [fragment], and the variable value is the last dominating dbg.value assignment (or possibly a PHI between multiple of them, determined by LiveDebugValues). What is the equivalent for these new intrinsics?

There are actually multiple locations at runtime; as you say, the compiler must guarantee they contain the same value at runtime, so the debug info consumer can read from any location. However, the debug info consumer must write to each location.

Instead of intrinsics acting as assignments to a mutable, singleton "variable location" they instead each act independently and must refer to a distinct "lifetime" (DILifetime). If in the old world there are 4 calls to dbg.value for a single variable, the new version would instead create 4 DILifetimes and replace each dbg.value with a pair of non-overlapping dbg.def+dbg.kill

I think lifetimes and def/kill relationships makes sense after register allocation where that's the form the program is in, but it's not clear how it would work in SSA form. It's also worth noting that the multiple-locations-after-regalloc problem is solved, to a large extent, by the instruction referencing rewrite [0], essentially keeping the debugging information in SSA form and then recognising target COPY and value-movements to track the multiple locations a value can be resident.

Even in SSA an llvm::Value may only coincide with the source variable for part of its existence. In the old model this is represented by having multiple calls to dbg.value which refer to the same source variable. What about the def/kill representation seems like it won't work in SSA form?

I'm not sure I understand the relationship between the instr-ref work and multiple-locations; it seems to me that it still only leaves us with one machine location per variable at any given position in the program, or am I not understanding something?

There's value in having multiple ways of expressing variable locations, during loop-strength-reduction you can recompute a variable from the loop starting values or from the strength-reduced variables, for example. It needs to be approached with some delicacy though to save memory.

At a more abstract level, I've a worry that this might move us more in the direction of requiring more knowledge / maintenence during optimisation passes to preserve debug-info invariants, where it seems more beneficial to reduce that kind of maintenence, in compile and engineering time. It's certainly the motivation behind the assignment tracking work [1], which is inferring information about optimisations from what gets deleted rather than what gets preserved.

I need to spend a bit more time going through the assignment-tracking work to form a better response, but the principle of reducing the work required in the vast majority of passes while maintaining meaningful and accurate debug information sounds great!

[0] https://www.youtube.com/watch?v=yxuwfNnp064
[1] https://discourse.llvm.org/t/rfc-assignment-tracking-a-better-way-of-specifying-variable-locations-in-ir/62367

There are actually multiple locations at runtime; as you say, the compiler must guarantee they contain the same value at runtime, so the debug info consumer can read from any location. However, the debug info consumer must write to each location.

Right, that's how multiple-locations has to work. DWARF already allows this, it's a matter of persuading LLVM to understand multiple concurrent locations and emit the DWARF accordingly. And getting the debuggers to DTRT.

Instead of intrinsics acting as assignments to a mutable, singleton "variable location" they instead each act independently and must refer to a distinct "lifetime" (DILifetime). If in the old world there are 4 calls to dbg.value for a single variable, the new version would instead create 4 DILifetimes and replace each dbg.value with a pair of non-overlapping dbg.def+dbg.kill

...but I don't see how non-overlapping lifetimes gets us to multiple concurrent locations.

Instead of intrinsics acting as assignments to a mutable, singleton "variable location" they instead each act independently and must refer to a distinct "lifetime" (DILifetime). If in the old world there are 4 calls to dbg.value for a single variable, the new version would instead create 4 DILifetimes and replace each dbg.value with a pair of non-overlapping dbg.def+dbg.kill

...but I don't see how non-overlapping lifetimes gets us to multiple concurrent locations.

I agree they do not, I meant to reply specifically the question of how one would represent the current case where "dbg.value intrinsics are effectively an assignment to the variable [fragment], and the variable value is the last dominating dbg.value assignment". That is, if there is truly only one active location, then there is only one active DILifetime. If that location changes throughout (i.e. in the old scheme there are multiple intrinsics) then in the new scheme there will be multiple non-overlapping lifetimes.

Also, there may be some delay before I update the review at all, as I'm trying to reconcile the more constrained multiple-location support that is a part of the DIAssignID work with the more general approach we took in this review. I see the great appeal in exploiting regularity in the kinds of locations LLVM actually encounters. If I understand the approach, it relies on the observation that a variable typically has some "home" alloca, and many other "vacation" Values, between which it moves (at times occupying multiple) and exploiting this can simplify the work of pass writers and make most operations on IR "just work" in tracking which locations are valid for any given assignment.

StephenTozer added inline comments.Dec 8 2022, 3:32 AM
llvm/docs/AMDGPULLVMExtensionsForHeterogeneousDebugging.rst
2645–2649

I'm unsure what this part means - is it implying that this work gets equivalent results to the instruction referencing implementation and you're not sure why that is the case? I would have thought that in principal, the MIR form of this work would ideally use instruction references wherever possible to prevent lifetime ranges that should be non-overlapping from becoming awkwardly tangled up during CodeGen.

Orlando added a subscriber: Orlando.Dec 8 2022, 8:54 AM

Instead of intrinsics acting as assignments to a mutable, singleton "variable location" they instead each act independently and must refer to a distinct "lifetime" (DILifetime). If in the old world there are 4 calls to dbg.value for a single variable, the new version would instead create 4 DILifetimes and replace each dbg.value with a pair of non-overlapping dbg.def+dbg.kill

...but I don't see how non-overlapping lifetimes gets us to multiple concurrent locations.

I agree they do not, I meant to reply specifically the question of how one would represent the current case where "dbg.value intrinsics are effectively an assignment to the variable [fragment], and the variable value is the last dominating dbg.value assignment". That is, if there is truly only one active location, then there is only one active DILifetime. If that location changes throughout (i.e. in the old scheme there are multiple intrinsics) then in the new scheme there will be multiple non-overlapping lifetimes.

Also, there may be some delay before I update the review at all, as I'm trying to reconcile the more constrained multiple-location support that is a part of the DIAssignID work with the more general approach we took in this review. I see the great appeal in exploiting regularity in the kinds of locations LLVM actually encounters. If I understand the approach, it relies on the observation that a variable typically has some "home" alloca, and many other "vacation" Values, between which it moves (at times occupying multiple) and exploiting this can simplify the work of pass writers and make most operations on IR "just work" in tracking which locations are valid for any given assignment.

Hi @scott.linder, FWIW I'm the author of those Assignment Tracking patches (DIAssignID / dbg.assign stuff). I'm not commenting to review (yet!) - just wanted to chip in to say that your understanding is correct. Perhaps one important clarification is that, while the "home" and "vacation" locations are both tracked through IR optimisations, there's an analysis pass that runs before ISel that flattens this so that just one location is available for any given instruction range. In other words, we track multiple locations but the DWARF does not have overlapping entries in location lists. I'm happy to answer any questions as they come up.

(I really enjoy the "vacation" terminology for variables not in their "home" and wish I'd thought of it)