This is an archive of the discontinued LLVM Phabricator instance.

Patch for LLDB demangler for demangling upon actual language
AbandonedPublic

Authored by ki.stfu on Jan 30 2015, 11:15 AM.

Details

Summary

This patch includes following changes:

  • Add Language member to Mangled class with corresponding constructor. Minor refactoring in places where Mangled instances used.
  • Function name demangling depending on its compile unit actual language.
  • Enable demangling for case if function/sym language still unknown at moment when demangling requested.
  • Flag indicating known language was set added. Demangler now is able to re-demangle given sym name if it language was re-set since Mangled object creation. Demangling against unknown language is allowed because in lldb unknown language means C++/ObjC in most cases implicitly.

Diff Detail

Event Timeline

ki.stfu updated this revision to Diff 19049.Jan 30 2015, 11:15 AM
ki.stfu retitled this revision from to Patch for LLDB demangler for demangling upon actual language.
ki.stfu updated this object.
ki.stfu edited the test plan for this revision. (Show Details)
ki.stfu added reviewers: clayborg, zturner, jingham, k8stone.
ki.stfu set the repository for this revision to rL LLVM.
ki.stfu added subscribers: Unknown Object (MLST), clayborg, zturner and 2 others.
zturner added inline comments.Jan 30 2015, 11:28 AM
include/lldb/Core/Mangled.h
94

I don't like having this second argument here. If you know in advance the name is demangled, why even use this class for anything?

347

Just call it m_language_was_set. We don't use hungarian notation.

source/Core/Mangled.cpp
5095

I think the default constructor needs to be updated to initialialize m_language and m_b_language_was_set.

5226

Instead of having this check, can you just clear m_demangled when you call SetLanguage()? In fact, we could probably get rid of m_bLanguage_was_set entirely if you do this.

5271–5273

can the two separators be const char * instead of an std::string (to avoid a heap allocation), and also moved inside of the case statement for Pascal / Java? None of the other branches use them. Also temp_demangled_name can be moved into the case as well, and the default case statement can just use demangled_name.c_str()

5301

This line also goes away if we delete this variable and call m_demangled.Clear() in SetLanguage()

5364

Should this only happen if the new language is not equal to the old language? (Also reiterating the comment about removing this variable and just clearing m_demangled)

zturner added inline comments.Jan 30 2015, 12:50 PM
include/lldb/Core/Mangled.h
94

Looks like this argument is here before. Removing it seems like a bigger task that's outside the scope of this CL. So ignore this comment for now.

clayborg requested changes to this revision.Jan 30 2015, 2:59 PM
clayborg edited edge metadata.

I really can't take the memory hit that is incurred by adding anything to the Mangled class. lldb_private::Symbols are one of the largest sources of memory consumption by a large debug session in LLDB and adding and extra 64 bits (32 for the enum and 1 byte for the bool and then padding) to each one it too much. Can we make this change in a way that doesn't require the extra storage?

Can't we tell what is mangled by looking at the prefix in a Mangled instance? If not, you will need to make a MangleWithLanguage instance for places the language needs to be used if it can't be deduced by the mangling prefix. The "m_language" is not really used inside the the Mangled class, so it isn't required for anything inside of the Mangled class.

You should also not have to modify the DWARF parser at all, you can just get the lldb_private::CompileUnit from the DWARF compile unit and ask the lldb_private::CompileUnit for its language (lldb_private::CompileUnit::GetLanguage()).

This revision now requires changes to proceed.Jan 30 2015, 2:59 PM

To make my comments clearer:

1 - Can't we tell the language by looking at the Mangled prefix? "_Z" = C++, "$" = C++?
2 - Where do we need to know the language? I the JIT? If so we might need a new MangledWithLanguage class and use it only where we need it so it doesn't make lldb_private::Symbol and lldb_private::Function and lldb_private::Block's inlined info take up more space. We spent a lot of time doing memory analysis on LLDB and lldb_private::Symbol objects were consistently at the top of the heap.

zturner edited edge metadata.Jan 30 2015, 3:17 PM

Greg: How important is it from a performance perspective that Symbol be able to cache the demangled name and not recompute it every time you call GetDemangledName?

Not very important actually because our constant string pool (lldb_private::ConstString) has a:

bool
ConstString::GetMangledCounterpart (ConstString &counterpart) const;

So we could just store the name and note if is is mangled or not and we would still have an accessor on symbol that we could change from:

Mangled&
GetMangled ()
{
    return m_mangled;
}

const Mangled&
GetMangled () const
{
    return m_mangled;
}

to be:

Mangled
GetMangled ()
{

Mangled mangled;

/// Calculate and return a Mangled instance...

...
return mangled;

}

But again, this would help get lldb_private::Symbol to take less memory than it already does, but I still wouldn't want to lose those gains in savings by adding language with every mangled name. It doesn't really need to be there. And where it does and where it is important, we could add a language. It seems like a heavy hammer for a small controlled use case.

Greg

I agree that if memory is important then we should use the opportunity to reduce memory usage rather than keeping it the same by changing stuff. But the reason I asked leads into my next question.

I've been thinking about mangling and demangling for a while and how it relates to Windows. I see a lot of code all over the place that manually inspects mangled names, and usually the code is all custom and handrolled. (If you're interested I can point you to a bunch of examples). I don't like this way of doing things and I think it's generally fragile. There should be one place that's responsible for anything to do with mangling. All these places that are inspecting strings for _Z or ? should just be calling some class to ask it about the properties of this string.

The most sensible place to do that, to me, seems like the ABI. So I'm imagining that there's a Mangler base class, and then from that there is an ItaniumCppMangler, a MsCppMangler, and let's say perhaps a JavaMangler for the purposes of this CL. Maybe they share some code, but that's not the important part.

ABI provides a method called getMangler(). It returns a singleton instance (which for Windows would be an MsCppMangler, and for everyone else would be an ItaiumCppMangler).

In the Symbol class, then, all you need to store is the mangled name. Implement a method in Symbol called getMangler() which looks at m_comp_unit, and either gets the ABI and calls getCppMangler (if Lang is C++) or a null mangler (if Lang is C) or a java mangler (if Lang is Java), etc.

Then, just call the method on it.

All this seems complicated, but the advantage is that now this logic is abstracted for anyone else who wants to use it. The Mangler interface could provide such methods as IsGuardVariable() or IsFunction() that things like the interpreter could use by getting the correct mangler from the ABI, for example. And all of the places in the code that currently have hardcoded mangler checks could be made to work in the presence of ABI differences and language differences easily.

And this doesn't impose any memory penalty on Symbol (and actually reduces the footprint of each Symbol by the size of 1 pointer)

I do also want to stress that I believe that all the places that are manually checking for "_Z" and the likes switch over to using methods on Mangled so we can avoid the issues you are seeing.

Changes requested in this review were taken into account in reworked patch - see http://reviews.llvm.org/D7409.

To make my comments clearer:

1 - Can't we tell the language by looking at the Mangled prefix? "_Z" = C++, "$" = C++?
2 - Where do we need to know the language? I the JIT? If so we might need a new MangledWithLanguage class and use it only where we need it so it doesn't make lldb_private::Symbol and lldb_private::Function and lldb_private::Block's inlined info take up more space. We spent a lot of time doing memory analysis on LLDB and lldb_private::Symbol objects were consistently at the top of the heap.

1 - No - because, for example, Delphi use same mangling, but has thin differences in types name demangling and namespace separator (same with Java).

2 - We need to know exact language for given function or method name demangling with its parameters types on full header.

I really can't take the memory hit that is incurred by adding anything to the Mangled class. lldb_private::Symbols are one of the largest sources of memory consumption by a large debug session in LLDB and adding and extra 64 bits (32 for the enum and 1 byte for the bool and then padding) to each one it too much. Can we make this change in a way that doesn't require the extra storage?

Can't we tell what is mangled by looking at the prefix in a Mangled instance? If not, you will need to make a MangleWithLanguage instance for places the language needs to be used if it can't be deduced by the mangling prefix. The "m_language" is not really used inside the the Mangled class, so it isn't required for anything inside of the Mangled class.

You should also not have to modify the DWARF parser at all, you can just get the lldb_private::CompileUnit from the DWARF compile unit and ask the lldb_private::CompileUnit for its language (lldb_private::CompileUnit::GetLanguage()).

Greg, Mangled class is low level helper to Symbol, Function and other code which use demangling. If DWARF allocated Mangled instance - language should be passed from DWARF attrs here - because by itself Mangled can't have reverse references to classes which can use Mangled class.

include/lldb/Core/Mangled.h
347

Fixed - see reworked patch at http://reviews.llvm.org/D7409.

source/Core/Mangled.cpp
5095

Fixed - see reworked patch at http://reviews.llvm.org/D7409.

5226

No, because my idea was - initially Symbol or Function name demangled against unknown (i.e. = C++ implicitly) language in places where language cannot be actualized, then language may be actualized (from CU language or if user set language explicitly for frame) and Mangled should do re-demangling upon actual language given via constructor or via SetLanguage() method. If set language still unknown - re-demangling is not performed. If you will clear m_demangled unconditionally on each SetLanguage() invocation you will be obliged to re-demangle name again and again - i.e. performance impact. So flag needed in addition to m_demangled != NULL check.

5271–5273

Fixed - see reworked patch at http://reviews.llvm.org/D7409.

5301

Please see my answer about why flag needed instead just m_demangled != NULL check above.

5364

Yes - delayed re-demangling starts only in case language changed to known language instead of unknown. But initially (first time) demangling always performed blindly against unknown language in C++ manner - because lots of language dependent lldb code written in implicit assumption what language is C++/ObjC, so I created my patch with respect to this.

ki.stfu abandoned this revision.Feb 5 2015, 8:04 AM

See additional information at http://reviews.llvm.org/D7409

dawn added a subscriber: dawn.Jul 1 2015, 12:18 PM

See http://reviews.llvm.org/D10744 for a rewrite of this patch.