This is an archive of the discontinued LLVM Phabricator instance.

WIP: Reenable creation of artificial methods in AddMethodToCXXRecordType(...)
Needs ReviewPublic

Authored by shafik on May 12 2020, 2:53 PM.

Details

Summary

The framework for dealing with artificial methods is in place in AddMethodToCXXRecordType(...) but was gated by an early exit at some point.

This was due to implicit methods being emitted in some translations units but not others and then sometimes during import via the ASTImporter this would fail since they would no longer be structurally equivalent.

We have not been able to come up with a reproducer for the original issue and after fixing D79251 which was accidentally working the test suite passes with this change and we have cases that are broken without this change.

Diff Detail

Event Timeline

shafik created this revision.May 12 2020, 2:53 PM
shafik planned changes to this revision.May 12 2020, 2:54 PM
shafik requested review of this revision.May 12 2020, 2:54 PM
shafik retitled this revision from Reenable creation of artificial methods in AddMethodToCXXRecordType(...) to WIP: Reenable creation of artificial methods in AddMethodToCXXRecordType(...).

I labeled this as WIP in progress since I expect some discussion and maybe a test case for which the change fails.

In general, I think this is a good direction to go in. We can run into the same kinds of problems even with non-artificial functions -- either the application can have real ODR violations, or we can introduce ODR violations during expression eval by playing fast-and-loose with anonymous namespaces.

For all of those cases we need a more general approach to handling the "I have two versions of a class, and they don't really match" problem.

As for a test case where this would fail, I haven't had first hand experience with this, but I would guess it involves two shared libraries, one which contains the definition of this artificial constructor, and one which doesn't. And then pulling both of these class definitions into the same expression...

lldb/test/API/lang/cpp/constructors/main.cpp
11

Just for my own education, how would this fail without the patch? During expression evaluation, we would synthesize a call to the default constructor of ClassForSideEffect, instead of the one taking int=100? Or is it something else?

shafik marked an inline comment as done.May 13 2020, 10:17 AM

In general, I think this is a good direction to go in. We can run into the same kinds of problems even with non-artificial functions -- either the application can have real ODR violations, or we can introduce ODR violations during expression eval by playing fast-and-loose with anonymous namespaces.

For all of those cases we need a more general approach to handling the "I have two versions of a class, and they don't really match" problem.

As for a test case where this would fail, I haven't had first hand experience with this, but I would guess it involves two shared libraries, one which contains the definition of this artificial constructor, and one which doesn't. And then pulling both of these class definitions into the same expression...

Yes, in some offline conversations with @clayborg that was the scenario he had seen this come up before. I was not able to synthesize a test case that ran into this issue but that does not mean it is not possible. Perhaps I just wasn't creative enough ;-)

lldb/test/API/lang/cpp/constructors/main.cpp
11

So it would not initialize x to 10 and it would not call ClassForSideEffect(100) it would just default initialize it. I don't quite understand why, I plan to dig into that some more but I wanted to get the conversation rolling.

The error where two of the same classes conflicted usually only happened in complex cases that were hard to reduce down.

If I understand this correctly, the compiler should be able to auto synthesize these functions at will since the original class definition should have all of the information it needs to create them. So why do we need these to be added? Seems like the wrong approach IMHO. I would like the hear the justification for needing to add artificial functions to a class definition before we entertain this more. Can you elaborate on why you think these are needed?

The error where two of the same classes conflicted usually only happened in complex cases that were hard to reduce down.

If I understand this correctly, the compiler should be able to auto synthesize these functions at will since the original class definition should have all of the information it needs to create them. So why do we need these to be added? Seems like the wrong approach IMHO. I would like the hear the justification for needing to add artificial functions to a class definition before we entertain this more. Can you elaborate on why you think these are needed?

Yes, the test case below shows that if we don't add the method then during expression evaluation of let's say:

A{}

The default member initializer won't be done. So x won't be initialized to 10 and the default constructor for cse will be called instead of ClassForSideEffect(int).

The error where two of the same classes conflicted usually only happened in complex cases that were hard to reduce down.

If I understand this correctly, the compiler should be able to auto synthesize these functions at will since the original class definition should have all of the information it needs to create them. So why do we need these to be added? Seems like the wrong approach IMHO. I would like the hear the justification for needing to add artificial functions to a class definition before we entertain this more. Can you elaborate on why you think these are needed?

Yes, the test case below shows that if we don't add the method then during expression evaluation of let's say:

A{}

The default member initializer won't be done. So x won't be initialized to 10 and the default constructor for cse will be called instead of ClassForSideEffect(int).

So that sounds like a compiler bug or something that we are not setting up right in the class type when we convert it from DWARF. Can you compile a quick example and dump the AST somehow and then compare it to the class we generate from debug info and see how they differ?

The error where two of the same classes conflicted usually only happened in complex cases that were hard to reduce down.

If I understand this correctly, the compiler should be able to auto synthesize these functions at will since the original class definition should have all of the information it needs to create them. So why do we need these to be added? Seems like the wrong approach IMHO. I would like the hear the justification for needing to add artificial functions to a class definition before we entertain this more. Can you elaborate on why you think these are needed?

Yes, the test case below shows that if we don't add the method then during expression evaluation of let's say:

A{}

The default member initializer won't be done. So x won't be initialized to 10 and the default constructor for cse will be called instead of ClassForSideEffect(int).

So that sounds like a compiler bug or something that we are not setting up right in the class type when we convert it from DWARF. Can you compile a quick example and dump the AST somehow and then compare it to the class we generate from debug info and see how they differ?

The DWARF will not contain the description of the =10 part of int x=10; in the class definition. Without it, we don't have a way to know what the default constructor will do. It's true that for this variable, the compiler could emit DW_AT_default_value(10) to describe that, but the approach does not seem feasible in general -- the RHS can be an arbitrarily complex expression, referencing other functions, variables, etc.

It seems to me you're assigning more power to DW_AT_artificial than what it actually does. All that attribute says is that the entity was generated by the compiler ("Objects or types that are not actually declared in the source" in spec's own words). It does not mean that its semantics are fully described by the rest of the debug info. Since the compiler was able to generate that function, its semantics must have been fully described by the _source_ code, but dwarf also does not claim to describe the entirety of the source code.

Before c++11 came along with default initializers, it was mostly correct to say that "artificial" default constructors could be regenerated from dwarf at will -- because all that constructor could do is default-initialize all sub-objects. However, that is no longer the case...

Here's another interesting use of aritificial functions: inherited constructors.

struct A { A(int); };
struct B:A { using A::A; };
B b(2);

The constructor B::B(int) will be marked artificial, but it is also not reconstructible from the debug info because it use explicit default initializers for all members in B (if they existed -- this example does not have any).

This example also demonstrates what I believe *is* a bug in the compiler. The inherited constructor will get DW_AT_name(A):

0x0000003f:   DW_TAG_structure_type
                DW_AT_calling_convention	(DW_CC_pass_by_value)
                DW_AT_name	("B")
                DW_AT_byte_size	(0x01)
                DW_AT_decl_file	("/home/pavelo/ll/build/opt/<stdin>")
                DW_AT_decl_line	(1)

0x00000048:     DW_TAG_inheritance
                  DW_AT_type	(0x0000005f "A")
                  DW_AT_data_member_location	(0x00)

0x0000004e:     DW_TAG_subprogram
                  DW_AT_name	("A")
                  DW_AT_declaration	(true)
                  DW_AT_artificial	(true)
                  DW_AT_external	(true)

That doesn't sound correct to me. Gcc emits the name as B, which seems to be much better. Looping in @dblaikie for thoughts.

I really worry enabling this will make the expression parser start to emit errors if we change this. The best fix would be to ensure that the AST importer can correctly ignore implicit functions when comparing types.

Here's another interesting use of aritificial functions: inherited constructors.

struct A { A(int); };
struct B:A { using A::A; };
B b(2);

The constructor B::B(int) will be marked artificial, but it is also not reconstructible from the debug info because it use explicit default initializers for all members in B (if they existed -- this example does not have any).

This example also demonstrates what I believe *is* a bug in the compiler. The inherited constructor will get DW_AT_name(A):

0x0000003f:   DW_TAG_structure_type
                DW_AT_calling_convention	(DW_CC_pass_by_value)
                DW_AT_name	("B")
                DW_AT_byte_size	(0x01)
                DW_AT_decl_file	("/home/pavelo/ll/build/opt/<stdin>")
                DW_AT_decl_line	(1)

0x00000048:     DW_TAG_inheritance
                  DW_AT_type	(0x0000005f "A")
                  DW_AT_data_member_location	(0x00)

0x0000004e:     DW_TAG_subprogram
                  DW_AT_name	("A")
                  DW_AT_declaration	(true)
                  DW_AT_artificial	(true)
                  DW_AT_external	(true)

That doesn't sound correct to me. Gcc emits the name as B, which seems to be much better. Looping in @dblaikie for thoughts.

Agreed - sounds like a bug/something to fix in Clang there - if you'd like to fix it/file a bug/etc.

I just realized a different problem with this approach. Since the constructor DIE is only emitted when it is actually used, we could run into problems/inconsistencies even within a single library. We currently parse only one instance of a class description -- assuming all of them to be equivalent. This means that if we happen to pick the class description without the constructor as *the* definition for the class, then the we would still fall back to the (incorrect) constructor that clang generates for us. In this test case that could be demonstrated by having another compile unit, which only uses ClassWithImplicitCtor, but never constructs it, and then arranging it such that this file is picked for the class definition (probably involves putting this file first on the link line, or ensuring the first breakpoint hit is in that file).

Fixing that is going to be tricky. The best I can come up with right now, is injecting a constructor declaration into a class even if DWARF does not describe one. Then, if the user calls it, we go searching for the constructor symbol in the module. If we find one, we use it. If we don't, we bail out of the expression.

I think that approach would result in the most correct behavior, and it would also solve the structural equivalence issues that Greg is worried about (all classes would have the default constructor). However, it does have some serious drawbacks: The presence of a user-specified constructor (which is how this will appear to clang) disables certain ways of initializing the class. So for instance, the user would not be able to do this,

code:
struct Simple { int a, b, c; };
lldb:
p Simple() # value-initialized to zero
p Simple{47} # compound initialization, a=47, b=c=0
p Simple{1, 2, 3} # compount initalization with all members

That is kind of correct, since we're actually not able to differentiate the above class, from a class like this struct SeeminglySimple { int a = random(), b = random(), c = random(); };. However, I have a feeling it would annoy a lot of people. The best we could do is make the third expression work by synthesizing an (int, int, int) constructor initializing all members, as that will not use the default initializers (if they existed).

I have also considered the idea of trawling through all definitions of the class, in search of the default constructor. However:

  • it's possible that no compile unit uses it, but the class still contained default initializers. Then, we will still get the initialization wrong.
  • it won't help with partial compound initialization (second expression)
  • it will slow us down (but maybe not so much with indexes, as we know the constructor name)

So, it does seem like some help from the compiler is called for. Maybe the compiler could attach a DW_AT_default_value with an empty expression to the members which have default initializers, as a way of saying "this member has a default initializer, but I can't/won't tell you what it is" ? Then we could search for this attribute and refuse constructing classes which have it. That would enable us to keep p Simple{} working, while preventing p SeeminglySimple{}.

This example also demonstrates what I believe *is* a bug in the compiler. The inherited constructor will get DW_AT_name(A):

0x0000003f:   DW_TAG_structure_type
                DW_AT_calling_convention	(DW_CC_pass_by_value)
                DW_AT_name	("B")
                DW_AT_byte_size	(0x01)
                DW_AT_decl_file	("/home/pavelo/ll/build/opt/<stdin>")
                DW_AT_decl_line	(1)

0x00000048:     DW_TAG_inheritance
                  DW_AT_type	(0x0000005f "A")
                  DW_AT_data_member_location	(0x00)

0x0000004e:     DW_TAG_subprogram
                  DW_AT_name	("A")
                  DW_AT_declaration	(true)
                  DW_AT_artificial	(true)
                  DW_AT_external	(true)

That doesn't sound correct to me. Gcc emits the name as B, which seems to be much better. Looping in @dblaikie for thoughts.

Agreed - sounds like a bug/something to fix in Clang there - if you'd like to fix it/file a bug/etc.

Yeah, I guess should do that, before we start needing to work around this bug for compatibility purposes. :)

FWIW I agree that based on the current DWARF LLVM and GCC produce there's no way to allow users to default construct structs in C++11 (where non-static data member initializers were added) or above in the absence of a user-provided constructor - I don't think it'd be unreasonable to attach a null/empty default_value for "has an initializer but I won't tell you what it is" - chances are plumbing this through Clang/LLVM metadata, etc, might go most of the way to plumbing through at least simple default initializer values too - so you might get the simple constant non-empty values for free here anyway, which would be nice/allow some default/implicit member initialization.

But if you're preparing to go down that route, a thread on llvm-dev with all the usual debug info cohort would be worthwhile to check this.

aganea added a subscriber: aganea.May 28 2020, 7:46 PM