This is an archive of the discontinued LLVM Phabricator instance.

Fix demangling of names if required by language
AbandonedPublic

Authored by dawn on Jun 25 2015, 12:43 PM.

Details

Summary

lldb assumes mangled names should be demangled using a C++ syntax. This patch works around that by allowing the language to be specified when demangling, and if necessary, forcing the demangled name to be recalculated if/when the language is known.

Diff Detail

Repository
rL LLVM

Event Timeline

dawn updated this revision to Diff 28493.Jun 25 2015, 12:43 PM
dawn retitled this revision from to Fix demangling of names if required by language.
dawn updated this object.
dawn edited the test plan for this revision. (Show Details)
dawn added reviewers: clayborg, chaoren, zturner.
dawn set the repository for this revision to rL LLVM.
dawn added a reviewer: KLapshin.
dawn added a subscriber: Unknown Object (MLST).
dawn added a comment.Jun 30 2015, 8:09 AM

Please review? Thank you.

source/Core/Mangled.cpp
351

Not sure why long lines appear indented wrong in the diff since no tabs were used and it looks fine in the source. Hmm...

clayborg requested changes to this revision.Jul 1 2015, 11:15 AM
clayborg edited edge metadata.

I would prefer to see this patch done differently. There are new methods make on Mangled and on ConstString that probably don't need to be there. Can we have the mangled names be correct from the start by specifying the language when we create the Mangled string in the first place? We don't want to store a language with the Mangled object because that would make lldb_private::Symbol take up even more space since it has a Mangled object inside of it and lldb_private::Symbol objects are the biggest memory hogs in LLDB right now (we did memory profiling a while back). So I would prefer to see a solution that allows clients to specify the language when setting the mangled name. If this mangled name is eLanguageTypePascal83 or eLanguageTypeJava, then demangle right away and fix up the mangled string before it gets entered as the demangled counterpart.

include/lldb/Core/ConstString.h
428–434

If my solution outlined in the request changes will work, the will go away.

include/lldb/Core/Mangled.h
185

If my solution outlined in the request changes will work, the will go away.

319–340

If my solution outlined in the request changes will work, the will go away.

source/Core/Mangled.cpp
263–359

If my solution outlined in the request changes will work, the will go away. This code will become part of the code that sets the mangled name in a new version of Mangled::SetValue(...) that takes a language parameter:

void
SetValue (const ConstString &name, bool is_mangled, LanguageType language);
474–512

If my solution outlined in the request changes will work, the will go away.

source/Plugins/SymbolFile/DWARF/DWARFCompileUnit.cpp
883

You are accidentally taking the current DIE and thinking it is the compile unit DIE? This is wrong. You should calculate it once at the top using:

const LanguageType cu_language = GetLanguageType();

Then use that everywhere. So remove this line of code.

906

Remove this and use the one that is calculated once at the top.

954

Remove this and use the one that is calculated once at the top.

source/Symbol/Function.cpp
219

If my solution outlined in the request changes will work, the will go away.

244

If my solution outlined in the request changes will work, the will go away.

This revision now requires changes to proceed.Jul 1 2015, 11:15 AM
dawn added a comment.Jul 1 2015, 12:16 PM

(fyi - added additional reviewers from related patch at http://reviews.llvm.org/D7302)

dawn added a comment.Jul 1 2015, 12:32 PM

I would prefer to see this patch done differently. There are new methods make on Mangled and on ConstString that probably don't need to be there.

New methods on ConstString are required in order to undo/redo the demangling once the language is known.

Can we have the mangled names be correct from the start by specifying the language when we create the Mangled string in the first place?

No, because the first time the symbol is demangled, we find the linker/minimal symbol and the language is unknown since there is no debug info yet. Then we parse the debug info and find the language in the CU, but the demangling has already taken place so we must undo what we did in the correct language.

We don't want to store a language with the Mangled object because that would make lldb_private::Symbol take up even more space since it has a Mangled object inside of it and lldb_private::Symbol objects are the biggest memory hogs in LLDB right now (we did memory profiling a while back).

I know. Hence this patch avoids the extra members.

So I would prefer to see a solution that allows clients to specify the language when setting the mangled name. If this mangled name is eLanguageTypePascal83 or eLanguageTypeJava, then demangle right away and fix up the mangled string before it gets entered as the demangled counterpart.

That is simply not possible. The language is not known until the debug info is read. For an exe that has mixed languages (e.g. C++ and Pascal) using Itanium mangling, only the debug info can tell us what the correct language is, and we don't get that information until the DWARF is parsed.

dawn added a comment.Jul 6 2015, 9:34 AM

Do you want me to submit a new patch with the minor changes related to "Remove this and use the one that is calculated once at the top" in Mangled.cpp? I can't fix the other things you suggested as they won't work as explained above.

source/Plugins/SymbolFile/DWARF/DWARFCompileUnit.cpp
883

Thanks for this. I didn't think GetLanguageType would work for some reason but it's been a while since I worked on this patch - need to have another look.

906

Thanks - will do.

954

Thanks - will do.

dawn added a comment.Jul 7 2015, 11:59 AM

Greg, please respond. Again, the solution you proposed will not work, because the language is unknown until the DWARF is read. This patch has minimal impact on implementations for which this is not an issue, yet is critical, as it allows us to support Delphi and C++ in the same binary. Please accept this patch after the issues you raised with in DWARFCompileUnit.cpp are addressed.

Is there anyway to appeal this? Can other folks please have a look? Someone with a better understanding of how the demangling is architected should be able to see why this patch is required.

I understand demangling just fine, I just don't like the way this patch adds functions that shouldn't need to exist. No one should be removing a mangling and replacing with another by patching up function names after they have been demangled.

I don't know how you can claim we don't know the language in the DWARF since we always know the compile unit language for _any_ DIE in the DWARF.

We won't know the language for symbols in object file symbol tables, but we always do for DWARF.

The other problem is we might have two symbols, one from Pascal and one from C++ that could have the same mangled name for different private non-exported symbols in two different shared libraries like "foo::bar" in C++ and "foo.bar" in pascal. Are there really no differences in the mangling for two such functions? Your current patch will just change the mangling for the C++ one over to use the pascal demangling and that will introduce a bug. It will only introduce the bug if the mangled name from pascal has ever been accessed, otherwise it won't change it over. This needs to be fixed in another way.

dawn added a comment.EditedJul 7 2015, 6:07 PM

Thank you very much for your reply.

Consider the following.
Suppose we have the following source written in Delphi that we want to debug (translated to C++ here for the purposes of explaination):

#include <stdio.h>
namespace ns {
    int func(int i) {
        return ++i;
    }
}
int main() {
    int y;
    y = ns::func(x);
    printf("y=%d\n", y);
    return 0;
}

In lldb, set a breakpoint at program load (before main) via 'b func'. The resulting code flow in lldb is as follows:

//# lldb first tries to get the symbol as a linker symbol; the lldb calls look something like:
BreakpointResolverName::SearchCallback
    ->Module::FindFunctions
    ->SymbolVendor::GetSymtab
    ->ObjectFileMachO::ParseSymtab
        # Reading thru linker symbols, cu-lang unknown.
    ->Mangled::GetDemangledName
        # This stores the demangled counterpart of '__ZN2ns4funcEi' in
        # the string pool as 'ns::func' because the language is unknown
        # and defaults to C++ syntax due to the Itanium mangling scheme.
# The match fails.
# Lldb continues the search and tries to match using the unqualified
# name, at which time the DWARF for the CU is read; the lldb calls look
# something like:
BreakpointResolverName::SearchCallback
    ->Module::FindFunctions
        # name_type_mask=56
    ->SymbolFileDWARF::ResolveFunction
    ->SymbolFileDWARF::ParseCompileUnit
        # DWARF for CU is read and language is now known.
# After reading the DWARF, the match is checked against the demangled
# name; the lldb calls look something like:
BreakpointResolverName::SearchCallback
    ->Module::FindFunctions
    ->Symtab::FindAllSymbolsWithNameAndType
    ->Mangled::GetDemangledName
        # This returns the previously saved counterpart from the string
        # pool 'ns::func' which fails to match since '::' is not
        # recognized as a namespace separator in Delphi.

This patch forces the demangling to be fixed up to 'ns.func' once the language for the CU is known (when the CU's DWARF is parsed), and the match in FindAllSymbolsWithNameAndType can then succeed. Since lldb demangles the name before the language is known, there is simply no way around having to replace the demangled name in the string pool.

BTW, I have implemented your suggested changes to the DWARF reader - thank you for those.

source/Plugins/SymbolFile/DWARF/DWARFCompileUnit.cpp
883

This change was made, tested, and will be in the final patch - thank you.

906

This change was made, tested, and will be in the final patch - thank you.

954

This change was made, tested, and will be in the final patch - thank you.

dawn added a comment.EditedJul 7 2015, 6:41 PM

I understand demangling just fine, I just don't like the way this patch adds functions that shouldn't need to exist. No one should be removing a mangling and replacing with another by patching up function names after they have been demangled.

I wish there were another way, alas lldb's design forces this.

I don't know how you can claim we don't know the language in the DWARF since we always know the compile unit language for _any_ DIE in the DWARF.

I didn't mean to claim that. Yes, we know the language in DWARF; it's when the symbol is first demangled (before DWARF is parsed) that we don't know the language.

We won't know the language for symbols in object file symbol tables, but we always do for DWARF.

Yup. Agreed.

The other problem is we might have two symbols, one from Pascal and one from C++ that could have the same mangled name for different private non-exported symbols in two different shared libraries like "foo::bar" in C++ and "foo.bar" in pascal.

Those should result in duplicate definition errors in the linker.

Are there really no differences in the mangling for two such functions?

Yes. Both 'ns::func' in C++ and 'ns.func' in Pascal mangle to '__ZN2ns4funcEi' in the example above. This makes it possible to call into Pascal from C++ and vice versa.

Your current patch will just change the mangling for the C++ one over to use the pascal demangling and that will introduce a bug. It will only introduce the bug if the mangled name from pascal has ever been accessed, otherwise it won't change it over. This needs to be fixed in another way.

Yes, it will demangle to C++ by default unless/until the CU is parsed, but there is no getting around that. Hence a patch may be forthcoming which allows a user to specify a default language to use for when the DWARF has not yet been read and/or doesn't exist, and '__ZN2ns4funcEi' will demangle to 'ns.func' in that case if the option is set to Pascal. I may also add support for an override language setting which will demangle the symbol in the requested language regardless of the DWARF.

Ideally lldb would have been designed more like gdb, where the demangling is done on the fly and '__ZN2ns4funcEi' would appear as 'ns.func' in a C++ context and 'ns.func' in a Pascal context, but the current design of lldb won't allow for that without major overhaul.

Greg: The other problem is we might have two symbols, one from Pascal and one from C++ that could have the same mangled name for different private non-exported symbols in two different shared libraries like "foo::bar" in C++ and "foo.bar" in pascal.

Dawn: Those should result in duplicate definition errors in the linker.

No they wouldn't if they were private symbols in two different shared libraries. So this can happen.

Dawn: Ideally lldb would have been designed more like gdb, where the demangling is done on the fly and '__ZN2ns4funcEi' would appear as 'ns.func' in a C++ context and 'ns.func' in a Pascal context, but the current design of lldb won't allow for that without major overhaul.

So demangling is very expensive and caching the results is a good design decision.

I agree we need to fix this in LLDB, but demangling a name incorrectly and playing with that for a while and then changing it later ins't a good fix either. So lets find a way to fix this, and do it right.

So the major problems we still have that we haven't solved:
1 - when we just have a symbol table, we have NO way of knowing how to demangle a name and we might show it incorrectly in stack backtraces if we don't have debug info. I have no solution for this unless object files contain some sort of tracking on what functions are pascal and I doubt we can do this.
2 - If we do have debug info we might need more than one demangling for a given mangled name, right now we only store one. We currently do this by storing the C++ demangled name in the StringPool entry. StringPool entries in the constant string pools have two "const char *" entries and it currently contains the demangled/mangled names for C++ or the mangled/demangled name for C++. This saves us a ton of time as we never have to demangle a name twice and demangling is very very expensive so this design will stay here. Why? Because the way GCC and clang make debug info, they duplicate std::string, and all STL types in each .o file so that means we end up with tons of duplicated mangled names in many many many .o files. We need to find a way to fit it demangling in for Pascal and Java.

So I propose that it is ok to have the Mangled::GetDemangledName() take a language:

const ConstString&
Mangled::GetDemangledName (lldb::LanguageType language) const

Then if the language isn't pascal or java, do what we currently do, no changes. If not, we create a side table for each language that demangles differently and store the results there using the code that you used to change the mangled name over. Then we just need to find the calls to GetDemangledName() and supply a language where we need to. This shouldn't be too hard to do.

dawn added a comment.Jul 8 2015, 11:34 AM

Greg: The other problem is we might have two symbols, one from Pascal and one from C++ that could have the same mangled name for different private non-exported symbols in two different shared libraries like "foo::bar" in C++ and "foo.bar" in pascal.

Dawn: Those should result in duplicate definition errors in the linker.

No they wouldn't if they were private symbols in two different shared libraries. So this can happen.

Yes. And in that case the last module's CU which was read wins. I'm willing to accept this edge case until lldb can be redesigned to handle it.

Dawn: Ideally lldb would have been designed more like gdb, where the demangling is done on the fly and '__ZN2ns4funcEi' would appear as 'ns.func' in a C++ context and 'ns.func' in a Pascal context, but the current design of lldb won't allow for that without major overhaul.

So demangling is very expensive and caching the results is a good design decision.

I understand that that was the intent.

So the major problems we still have that we haven't solved:
1 - when we just have a symbol table, we have NO way of knowing how to demangle a name and we might show it incorrectly in stack backtraces if we don't have debug info. I have no solution for this unless object files contain some sort of tracking on what functions are pascal and I doubt we can do this.

Right. For this, I propose adding the default language option so that the user can see these names demangled in the language of their choosing. FYI, GDB has an override option, and if not set, it uses the language of the selected frame's CU, but getting the frame info to where the demangling/matching happens in lldb is a nightmare.

So I propose that it is ok to have the Mangled::GetDemangledName() take a language:

const ConstString&
Mangled::GetDemangledName (lldb::LanguageType language) const

Then if the language isn't pascal or java, do what we currently do, no changes. If not, we create a side table for each language that demangles differently and store the results there using the code that you used to change the mangled name over. Then we just need to find the calls to GetDemangledName() and supply a language where we need to. This shouldn't be too hard to do.

I tried this approach first. It was nearly impossible to pass the correct language down to GetDemangledName, as there were many layers of calls to get to the CU and/or frame. It would have required changes to many classes and functions in order to pass the language to the demangled call. I abandoned that approach after several attempts. You would have rejected it due to the extra class members alone.

This patch solves many of the problems with minimal impact to other implementations and to the design of lldb. I can put a giant FIXME/TODO comment in with your suggestions and an acknowledgement that this isn't perfect but it's the best we can do for now. Please accept this patch as an interim solution?

I don't want this patch in top of tree. Feel free to make a branch where you can maintain this fix, or fix it correctly and check it in to top of tree.

I will commit a fix that implements this correctly and let you finish it up once it is in.

dawn added a comment.Jul 8 2015, 2:33 PM

I will commit a fix that implements this correctly and let you finish it up once it is in.

Thanks! Do I abandon this review in the meantime?

Please abandon this change and modify the following function after updating to my partial fix that plumbs all the code to be able to supply the language:

% svn commit
Sending        include/lldb/Core/Mangled.h
Sending        include/lldb/Symbol/Function.h
Sending        include/lldb/Symbol/Symbol.h
Sending        include/lldb/Symbol/Variable.h
Sending        source/API/SBBlock.cpp
Sending        source/API/SBFrame.cpp
Sending        source/API/SBFunction.cpp
Sending        source/API/SBSymbol.cpp
Sending        source/Breakpoint/BreakpointLocation.cpp
Sending        source/Core/FormatEntity.cpp
Sending        source/Core/Mangled.cpp
Sending        source/Expression/IRExecutionUnit.cpp
Sending        source/Expression/IRForTarget.cpp
Sending        source/Plugins/DynamicLoader/Hexagon-DYLD/DynamicLoaderHexagonDYLD.cpp
Sending        source/Plugins/DynamicLoader/MacOSX-DYLD/DynamicLoaderMacOSXDYLD.cpp
Sending        source/Plugins/DynamicLoader/POSIX-DYLD/DynamicLoaderPOSIXDYLD.cpp
Sending        source/Plugins/LanguageRuntime/CPlusPlus/ItaniumABI/ItaniumABILanguageRuntime.cpp
Sending        source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
Sending        source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
Sending        source/Plugins/SymbolFile/DWARF/DWARFCompileUnit.cpp
Sending        source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
Sending        source/Plugins/SymbolFile/DWARF/SymbolFileDWARFDebugMap.cpp
Sending        source/Plugins/SymbolFile/Symtab/SymbolFileSymtab.cpp
Sending        source/Symbol/Function.cpp
Sending        source/Symbol/Symbol.cpp
Sending        source/Symbol/SymbolContext.cpp
Sending        source/Symbol/Symtab.cpp
Sending        source/Symbol/Variable.cpp
Sending        source/Target/ThreadPlanStepOverRange.cpp
Transmitting file data .............................
Committed revision 241751.

After this you will need to modify the following function:

const ConstString&
Mangled::GetDemangledName (lldb::LanguageType language) const

To check if the language is pascal or java and do your fixups on it... You will need to store the pascal demanglings in a thread safe side table of ConstString to ConstString objects. Probably safe to just make a global table in Mangled.cpp that is protected by a mutex.

dawn abandoned this revision.Jul 9 2015, 4:08 PM

Thanks Greg! I'm in the middle of implementing a few other language related features but will get on this as soon as I can. FYI, the language feature I'm adding now is to use the language of the selected frame's CU for evaluation, but I'm finding that it breaks many ObjC tests.