This is an archive of the discontinued LLVM Phabricator instance.

[CodeView] Enable debugging of captured variables within C++ lambdas
ClosedPublic

Authored by bwyma on Apr 9 2018, 6:31 AM.

Details

Summary

When emitting CodeView debug information, C++ lambda types need a unique identifier so Visual Studio can match the forward reference for the type to the definition. Currently they are all named "<unnamed-tag>". This patch, also necessary for D32498, will enable debugging of captured variables within lambdas using Visual Studio.

Previous LF_CLASS type record for a lambda looks like this:

Class (0x100A) {
  TypeLeafKind: LF_CLASS (0x1504)
  ...
  Name: main::<unnamed-tag>
}

After this change the type record looks like this:

Class (0x100A) {
  TypeLeafKind: LF_CLASS (0x1504)
  ...
  Name: main::<unnamed-tag>
  LinkageName: .?AV<lambda_0>@?0??main@@9@
}

Manufacturing a better "name" for the lambda object is not necessary to allow proper debugging of captured variables.

Diff Detail

Repository
rC Clang

Event Timeline

bwyma created this revision.Apr 9 2018, 6:31 AM
bwyma edited the summary of this revision. (Show Details)Apr 9 2018, 6:33 AM
rnk requested changes to this revision.Apr 10 2018, 10:34 AM

I don't think this is enough, this issue generalizes to more than just lambdas. The more general problem is this:

  • C++ classes with methods create a cycle in the type hierarchy (class refers to method, method refers to class)
  • Some C++ classes without names may not be externally visibible
  • CodeView requires class identifiers to enable this circular reference.

Here is a program that will need identifiers that your patch misses:

void foo() {
  struct {
      void bar() {}
  } obj;
  obj.bar();
}

I think we want to ask the question "does this class have any non-trivial methods" here, and if so, emit an identifier for codeview.

This revision now requires changes to proceed.Apr 10 2018, 10:34 AM
bwyma updated this revision to Diff 143719.Apr 24 2018, 5:54 AM

I updated the patch file to generalize the fix for more than just lambdas.

I think we want to ask the question "does this class have any non-trivial methods" here, and if so, emit an identifier for codeview.

I believe this is also insufficient because data member pointers also contain a class id referring to the forward reference. For example:

struct { int bar; } two = { 42 };
int decltype(two)::*ptr2unnamed = &decltype(two)::bar;

I added this example and Reid's example into the test case included with my patch.

rnk added inline comments.Apr 24 2018, 9:13 AM
llvm/tools/clang/lib/CodeGen/CGDebugInfo.cpp
805 ↗(On Diff #143719)

The indentation seems off which suggests the patch may contain tabs. Please check that there are none before submitting.

807 ↗(On Diff #143719)

This boolean condition is complicated. Can you find a way to break it up and add comments to clarify the intent? Basically, externally visible types with C++ manglings always need unique names. When emitting codeview, non-externally visible unnamed types also need unique names.

However, doesn't this conflict with the goal of D32498? At this point, Clang will put unique names on every unnamed type in C++ mode. That's fine with me, FWIW.

809 ↗(On Diff #143719)

My final concern is that the C++ mangled names we generate in this case are not necessarily unique across the whole program, which is what we mean when we say "unique name" in this context. Fixing that is out of scope for this patch, but we should file a bug for it. I guess here is a case where we'll have two unrelated types with the same identifier:

// a.cpp
namespace { struct { int x; } x; }
int getx() { return x.x; }
// b.cpp
namespace { struct { float y; } x; }
float gety() { return x.y; }

I think we're safe for types inside non-inline functions, because we mangle in the parent function as a scope, so no other TU will collide with that mangling. Only anonymous namespaces have this problem. MSVC addresses this by including a hash (MD5?) of the main source file path in the mangled name for the anonymous namespace.

llvm/tools/clang/test/CodeGenCXX/debug-info-codeview-unnamed.cpp
35 ↗(On Diff #143719)

We should have a test that in C mode, anonymous structs do not get identifiers. If we already have it, that's fine.

57 ↗(On Diff #143719)

Maybe "Non-externally visible named structures do not ..." to be specific.

Also, are you sure? I would expect the debugger to require a forward reference for these methods and data members, and how can that work without a unique name?

bwyma updated this revision to Diff 145673.May 8 2018, 5:53 AM

The indentation seems off which suggests the patch may contain tabs.

Fixed.

This boolean condition is complicated. Can you find a way to break it up and add comments ...

Fixed.

However, doesn't this conflict with the goal of D32498?

This patch fixes the identifier emission for C++ types. Patch D32498 is for C types.

My final concern is that the C++ mangled names we generate in this case are not necessarily unique ...

Your point is correct, the identifier could be more robust. But this patch still improves cases which can't be debugged currently.

We should have a test that in C mode, anonymous structs do not get identifiers.

I added a new test for your review.

Maybe "Non-externally visible named structures do not ..." to be specific.

Fixed.

I would expect the debugger to require a forward reference for these methods and data members ...

You are correct, and your suggestion matches what Visual Studio produces. I fixed the patch to match this behavior.

rnk added inline comments.May 10 2018, 3:04 PM
llvm/tools/clang/lib/CodeGen/CGDebugInfo.cpp
823–826 ↗(On Diff #145673)

These comments and method names don't make sense anymore, though. We can't actually produce unique names for internal types, it's impossible without some kind of per-TU GUID.

I think this code would be a lot less confusing if we rename this to something more mechanical like getTypeIdentifier. CodeView needs identifiers for all C++ types, even if the names aren't unique and might collide across TUs. For DWARF, we only use identifiers if we have a unique C++ mangled name. The comment should say something about that.

Does that make sense?

bwyma updated this revision to Diff 146612.May 14 2018, 8:05 AM

I think this code would be a lot less confusing if we rename this to something more mechanical like getTypeIdentifier

I changed the routine names, updated the call sites, and made the comment changes.
Please let me know if you like this better.

rnk accepted this revision.May 15 2018, 10:27 AM

lgtm, thanks!

This revision is now accepted and ready to land.May 15 2018, 10:27 AM
Closed by commit rC332975 (authored by bwyma). · Explain WhyMay 22 2018, 5:45 AM
This revision was automatically updated to reflect the committed changes.
dblaikie added inline comments.
lib/CodeGen/CGDebugInfo.cpp
812–814

Just came across this code today - it's /probably/ a problem for LTO. LLVM IR linking depends on the identifier to determine if two structs are the same for linkage purposes - so if an identifier is added for a non-linkage (local/not externally visible) type, LLVM will consider them to be the same even though they're unrelated:

namespace { struct t1 { int i; int j; }; }
t1 v1;
void *v3 = &v1;
namespace { struct t1 { int i; }; }
t1 v1;
void *v2 = &v1;
$ clang++-tot -emit-llvm -S a.cpp b.cpp -g -gcodeview  && ~/dev/llvm/build/default/bin/llvm-link -o ab.bc a.ll b.ll && ~/dev/llvm/build/default/bin/llvm-dis ab.bc && cat ab.ll | grep "\"t1\""
!8 = distinct !DICompositeType(tag: DW_TAG_structure_type, name: "t1", scope: !9, file: !3, line: 1, size: 64, flags: DIFlagTypePassByValue, elements: !10, identifier: "_ZTSN12_GLOBAL__N_12t1E")
$ clang++-tot -emit-llvm -S a.cpp b.cpp -g && ~/dev/llvm/build/default/bin/llvm-link -o ab.bc a.ll b.ll && ~/dev/llvm/build/default/bin/llvm-dis ab.bc && cat ab.ll | grep "\"t1\""
!8 = distinct !DICompositeType(tag: DW_TAG_structure_type, name: "t1", scope: !9, file: !3, line: 1, size: 64, flags: DIFlagTypePassByValue, elements: !10)
!21 = distinct !DICompositeType(tag: DW_TAG_structure_type, name: "t1", scope: !9, file: !17, line: 1, size: 32, flags: DIFlagTypePassByValue, elements: !22)

So in linking, now both a.cpp and b.cpp refer to a single t1 type (in this case, it looks like the first one - the 64 bit wide one).

If CodeView actually can't represent these two distinct types with the same name in the same object file, so be it? But this looks like it's likely to cause problems for consumers/users.

rnk added inline comments.Jan 11 2022, 11:28 AM
lib/CodeGen/CGDebugInfo.cpp
812–814

If you use the MSVC ABI, we will assign unique identifiers to these two structs:

!9 = distinct !DICompositeType(tag: DW_TAG_structure_type, name: "t1", scope: !10, file: !7, line: 1, size: 64, flags: DIFlagTypePassByValue, elements: !11, identifier: ".?AUt1@?A0xF964240C@@")

The A0xF964240C is set up here, and it is based on the source file path (the hash will only be unique when source file paths are unique across the build):
https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L464

// It's important to make the mangled names unique because, when CodeView
// debug info is in use, the debugger uses mangled type names to distinguish
// between otherwise identically named types in anonymous namespaces.

Maybe this should use a "is MSVC ABI" check instead, since that is what controls whether the identifier will be unique, and the unique-ness is what matters for LTO and PDBs.

rnk added inline comments.Jan 11 2022, 12:27 PM
lib/CodeGen/CGDebugInfo.cpp
812–814

After thinking about it some more, see the comment I added here: rG59320fc9d78237a5f9fdd6a5030d6554bb6976ce

// If the type is not externally visible, it should be unique to the current TU,
// and should not need an identifier to participate in type deduplication.
// However, when emitting CodeView, the format internally uses these
// unique type name identifers for references between debug info. For example,
// the method of a class in an anonymous namespace uses the identifer to refer
// to its parent class. The Microsoft C++ ABI attempts to provide unique names
// for such types, so when emitting CodeView, always use identifiers for C++
// types. This may create problems when attempting to emit CodeView when the MS
// C++ ABI is not in use.

I think type identifiers are pretty crucial for CodeView functionality. The debugger won't be able to link a method to its class without them. Therefore, I think it's better to leave this as it is. The bad experience of duplicate type identifiers is better than the lack of functionality from not emitting identifiers at all for non-externally visible types.

dblaikie added inline comments.Jan 11 2022, 5:51 PM
lib/CodeGen/CGDebugInfo.cpp
812–814

Yeah, thanks for explaining/adding the comment - that about covers it.

Hmm, do these identifiers just need to only be unique within a single object file to provide the name referencing mechanics? Perhaps we could generate them later to ensure they're unique - rather than risk collapsing them while linking?

(at least here's an obscure case where they seem to collide: Compiling the same file with different -D values, I guess the hash is only for the filename, not for other properties that might impact the output - so this ends up with two types with the same mangled name even on MSVC ABI:

namespace { struct t1 { int i;
#ifdef SECOND
  int j;
#endif
}; }
t1 v1;
void*
#ifdef SECOND
v2
#else
v3
#endif
= &v1;
clang++-tot -emit-llvm -S a.cpp -g -gcodeview  -target x86_64-windows && clang++-tot -emit-llvm -S a.cpp -DSECOND -g -gcodeview -o b.ll -target x86_64-windows && ~/dev/llvm/build/default/bin/llvm-link -o ab.bc a.ll b.ll && ~/dev/llvm/build/default/bin/llvm-dis ab.bc && cat ab.ll | grep "\"t1\""

)

bwyma added inline comments.Jan 25 2022, 8:29 AM
lib/CodeGen/CGDebugInfo.cpp
812–814

As Reid already noted, the unique id is required in CodeView to match type definitions to forward references. Unfortunately, the unique id must align across translation units. Compiling the same file multiple times with preprocessor conditionals will result in two different types with the same unique id. You'll see the same behavior with Visual Studio 2019.

dblaikie added inline comments.Jan 25 2022, 9:22 AM
lib/CodeGen/CGDebugInfo.cpp
812–814

I think what I was getting at is that in the case of mangled names for internal linkage things - presumably there's no ABI compatibility concern? It's not like something compiled with MSVC or another version of Clang has to use the same naming scheme for these local entities, so long as the naming scheme doesn't collide?

So at least for LTO if we generated the name later (in the CodeView backend) it could be more unique - an LTO build wouldn't collapse the two types together. It could still have an identifier (generated by the CodeView backend) that was used to reference decls and defs within that file.

rnk added inline comments.Jan 25 2022, 12:12 PM
lib/CodeGen/CGDebugInfo.cpp
812–814

Yes, we could calculate the unique identifier during LTO. However, that seems like a very marginal improvement in the final user experience since most users don't use LTO for incremental development, which is when type information is most useful.

On the other hand, if we want to establish the invariant that all DICompositeType identifiers are unique within a compile unit, we could probably go ahead and do that without impacting the user experience. We already have renaming logic like this for llvm::GlobalValues, and frontends are check for existing values of that name if they don't want renaming to occur.

dblaikie added inline comments.Jan 25 2022, 12:18 PM
lib/CodeGen/CGDebugInfo.cpp
812–814

if we want to establish the invariant that all DICompositeType identifiers are unique within a compile unit, we could probably go ahead and do that without impacting the user experience. We already have renaming logic like this for llvm::GlobalValues, and frontends are check for existing values of that name if they don't want renaming to occur.

That invariant exists - that's the original purpose of the identifiers - LLVM IR linking deduplicates on the basis of the identifier. So we can't/don't rename them (because the point is to deduplicate based on them) - we keep the first one we see and drop the rest.

Any idea what happens when non-LTO with MSVC and you end up with duplicate identifiers? Are both versions kept, but references end up ambiguous? Is the second version seen dropped as being a duplicate/unneeded?

If non-LTO MSVC behavior drops later instances of the identifier as duplicate, then at least the current LTO behavior is no worse than that - though there'd be some opportunity to make at least LTO better, but yeah, marginally beneficial.

rnk added inline comments.Jan 25 2022, 12:34 PM
lib/CodeGen/CGDebugInfo.cpp
812–814

Yes, the linker only retains one complete type definition for each unique type name, so you would get the same behavior with or without LTO.