This is an archive of the discontinued LLVM Phabricator instance.

Use rich mangling information in Symtab::InitNameIndexes()
ClosedPublic

Authored by sgraenitz on Jul 31 2018, 8:12 AM.

Details

Summary

I set up a new review, because not all the code I touched was marked as a change in old one anymore: https://reviews.llvm.org/D49990

In preparation for this review, there were two earlier ones:

Primary goals for this patch are:
(1) Use ItaniumPartialDemangler's rich mangling info for building LLDB's name index.
(2) Provide a uniform interface.
(3) Improve indexing performance.

The central implementation in this patch is our new function for explicit demangling:

const RichManglingInfo *
Mangled::DemangleWithRichManglingInfo(RichManglingContext &, SkipMangledNameFn *)

It takes a context object and a filter function and provides read-only access to the rich mangling info on success, or otherwise returns null. The two new classes are:

  • RichManglingInfo offers a uniform interface to query symbol properties like getFunctionDeclContextName() or isCtorOrDtor() that are forwarded to the respective provider internally (llvm::ItaniumPartialDemangler or lldb_private::CPlusPlusLanguage::MethodName).
  • RichManglingContext works a bit like LLVMContext, it the actual RichManglingInfo returned from DemangleWithRichManglingInfo() and handles lifetime and configuration. It is likely stack-allocated and can be reused for multiple queries during batch processing.

The idea here is that DemangleWithRichManglingInfo() acts like a gate keeper. It only provides access to RichManglingInfo on success, which in turn avoids the need to handle a NoInfo state in every single one of its getters. Having it stored within the context, avoids extra heap allocations and aids (3). As instantiations of the IPD the are considered expensive, the context is the ideal place to store it too. An efficient filtering function SkipMangledNameFn is another piece in the performance puzzle and it helps to mimic the original behavior of InitNameIndexes.

Future potential:

  • DemangleWithRichManglingInfo() is thread-safe, IFF using different contexts in different threads. This may be exploited in the future. (It's another thing that it has in common with LLVMContext.)
  • The old implementation only parsed and indexed Itanium mangled names. The new RichManglingInfo can be extended for various mangling schemes and languages.

One problem with the implementation of RichManglingInfo is the inaccessibility of class CPlusPlusLanguage::MethodName (defined in source/Plugins/Language/..), from within any header in the Core components of LLDB. The rather hacky solution is to store a type erased reference and cast it to the correct type on access in the cpp - see RichManglingInfo::get<ParserT>(). At the moment there seems to be no better way to do it. IMHO CPlusPlusLanguage::MethodName should be a top-level class in order to enable forward delcarations (but that is a rather big change I guess).

First simple profiling shows a good speedup. target create clang now takes 0.64s on average. Before the change I observed runtimes between 0.76s an 1.01s. This is still no bulletproof data (I only ran it on one machine!), but it's a promising indicator I think.

Event Timeline

sgraenitz created this revision.Jul 31 2018, 8:12 AM
sgraenitz edited the summary of this revision. (Show Details)Jul 31 2018, 8:45 AM

Thank you for updating the patch.

If I understand things correctly, we could avoid circular deps and untyped pointers (or llvm::Any, which is essentially the same thing), by moving CPlusPlusLanguage::MethodName to a separate file. Could we do that as a preparatory step for this patch?

source/Symbol/Symtab.cpp
222

make this static and drop the surrounding namespace.

labath added inline comments.Jul 31 2018, 10:06 AM
include/lldb/Core/RichManglingInfo.h
83–84 ↗(On Diff #158274)

I find it odd that one of these functions is stateless (SetLegacyCxxParserInfo takes the string which initializes the object as an argument), while the other receives the state magically via the IPD accessor.

Would it be possible to change it so that the other option is stateless too (SetItaniumInfo(ConstString mangled))?

We could change RichManglingInfo to return the full demangled name too, so that you have access to the demangled name via this API too (which /I think/ is the reason why you created a separate GetItaniumRichDemangleInfo function).

source/Core/RichManglingInfo.cpp
36–40 ↗(On Diff #158274)

Is this really the mangled name?

69–80 ↗(On Diff #158274)

Could these return StringRef? Am I correct in assuming that m_IPD_size holds the size of the returned string? If thats the case then you could even do this with no performance impact (or a positive one).

sgraenitz updated this revision to Diff 158638.Aug 1 2018, 2:26 PM

Minor improvements.

sgraenitz marked 2 inline comments as done.Aug 1 2018, 2:28 PM
sgraenitz added inline comments.
include/lldb/Core/RichManglingInfo.h
83–84 ↗(On Diff #158274)

Would it be possible to change it so that the other option is stateless too (SetItaniumInfo(ConstString mangled))?

I think that would be misleading for the reader. Symmetry is always nice to indicate congruent behavior, but this is not the case here. The fallback C++ parser version mimics the previous behavior: demangle the name in step 1 and if that succeeds, parse it with CPlusPlusLanguage::MethodName to decode its properties. The IPD does that in a single step and so there is nothing to do in SetItaniumInfo(). SetLegacyCxxParserInfo() on the other hand has to create a CPlusPlusLanguage::MethodName from the mangled string.

We could change RichManglingInfo to return the full demangled name too, so that you have access to the demangled name via this API too (which /I think/ is the reason why you created a separate GetItaniumRichDemangleInfo function).

The interface wants to provide access to the name's encoded properties. That doesn't really include the demangled "human readable" string representation (it's also not needed in Symtab::RegisterMangledNameEntry()).

Of course Symtab::InitNameIndexes() will ask for GetDemangledName() just after processing the mangled one. I think chances are good that for symbols that are not filtered out anyway, we reached DemangleWithRichManglingInfo() earlier and it will be in the string pool already (see Mangled.cpp:322). Otherwise, yes it will be demangled in with a new IPD instance.

I think it's not bad to keep these cases as they are. I would propose to investigate the performance of the current implementation in more detail and think about the benefits before mixing them. Not sure at all, if it would bring measurable gain.

source/Core/RichManglingInfo.cpp
36–40 ↗(On Diff #158274)

Yes, it is.

If I understand things correctly, we could avoid circular deps and untyped pointers (or llvm::Any, which is essentially the same thing), by moving CPlusPlusLanguage::MethodName to a separate file.

Banning public nested classes in general is a good practice as long as C++ is lacking rich forward declarations. This is the issue here. If we could forward declare the class, all was great. The problem is, of course, that there is a pattern behind Language::MethodName within the language plugins. So changing all of them? Hmm..

moving CPlusPlusLanguage::MethodName to a separate file.

You mean putting it in a separate file and include that one in the RichManglingInfo.h? I am not so familiar with the code yet, but it looks like there are reasons for having CPlusPlusLanguage under Plugins and separate these from other things like Core or Symbol. I don't think we should include a Plugins header from a Core header..

Could we do that as a preparatory step for this patch?

Well, now I spent the time to make the hack nice haha.
To be honest, I am happy to talk about a follow-up task here, but doing this now and before finishing the demangling sounds like a big unrelated piece of work.

labath added a comment.Aug 2 2018, 4:00 AM

If I understand things correctly, we could avoid circular deps and untyped pointers (or llvm::Any, which is essentially the same thing), by moving CPlusPlusLanguage::MethodName to a separate file.

Banning public nested classes in general is a good practice as long as C++ is lacking rich forward declarations. This is the issue here. If we could forward declare the class, all was great. The problem is, of course, that there is a pattern behind Language::MethodName within the language plugins. So changing all of them? Hmm..

moving CPlusPlusLanguage::MethodName to a separate file.

You mean putting it in a separate file and include that one in the RichManglingInfo.h? I am not so familiar with the code yet, but it looks like there are reasons for having CPlusPlusLanguage under Plugins and separate these from other things like Core or Symbol. I don't think we should include a Plugins header from a Core header..

I'd actually draw the line even higher, generic code shouldn't include Plugin code, ever (the SystemInitializer classes being the only exception, I guess). While having dependencies in the headers is more troublesome (e.g. for modules), it is not a line that can easily be held as refactorings (such as yours) which move functionality around (even with a single module) can easily cause a dep which was previously cpp-only to show up in the header too. However, it is true that we are very far from the line I wish to hold, while yours is almost there (I see only one exception where we include some clang stuff from ClangASTContext.h). I was not aware of this fact until now, so I do concede that it's good to not make this worse here.

Could we do that as a preparatory step for this patch?

Well, now I spent the time to make the hack nice haha.
To be honest, I am happy to talk about a follow-up task here, but doing this now and before finishing the demangling sounds like a big unrelated piece of work.

FWIW, I was not thinking of any major changes. Just moving the single c++ class to a separate file and that's it. We could even leave behind a typedef in CPlusPlusLanguage so that it's accessible under the old name too.

(However if you are interested in something like this, then it might be interesting to look at whether this MethodName stuff couldn't be properly pluginified. Something like where a Language class registers a callback for a specific mangling type, and then accessing that through this)

include/lldb/Core/RichManglingInfo.h
83–84 ↗(On Diff #158274)

I don't see how this would be misleading. Both versions provide a way to access the name's properties. The fact that they take slightly different inputs (one takes a mangled name, one demangled) makes them slightly different, but I don't find that too troublesome. We could encode that difference in the method name to make it more obvious. What troubles me more is that you have two-step initialization for the IPD case. It's always nice to avoid that, and that's particularly important when it's asymmetric with the other method.

Let me try to elaborate on how I'd do this. RichManglingContext (ps: maybe it's not correct to call this a *mangling* context, because the second method does not use mangled names. RichMethodContext ?) would have two methods:
fromItaniumName(ConstString ItaniumMangled) and fromCxxMethodName(ConstString Name). Both would return RichManglingInfo (RichMethodInfo ?) as they do now.

Then the usage would be something like:

case eManglingSchemeItanium:
  auto *info = context.fromItaniumName(m_mangled);
  m_demangled.SetCStringWithMangledCounterPart(info->GetFullName(), m_mangled);
  return info;

case eManglingSchemeMSVC:
  m_demangled.SetCStringWithMangledCounterpart(demangle_manually(m_mangled), m_mangled);
  return context.fromCxxMethodName(m_demangled);

The only change necessary for this is to include the full demangled name in the "name's encoded properties". While that may not be a *property* of the name by the strictest definition, I don't think it's much of a stretch to include it there, and it does not require making any compromises, as both methods already have access to this information.

So overall, I believe this should have no impact on performance, while improving the "symmetry" of the code (which is somewhat subjective, but I am hoping that we can agree that a having single method is better than having two methods which must be called in the right order). Is there anything I'm missing here?

source/Core/RichManglingInfo.cpp
36–40 ↗(On Diff #158274)

How is that possible? I see nothing in CPlusPlusLanguage::MethodName which would perform the demangling, and yet it clearly operates on demangled names. Also, the LHS of this patch creates MethodName instances like this
CPlusPlusLanguage::MethodName cxx_method(mangled.GetDemangledName(lldb::eLanguageTypeC_plus_plus));

(However if you are interested in something like this, then it might be interesting to look at whether this MethodName stuff couldn't be properly pluginified. Something like where a Language class registers a callback for a specific mangling type, and then accessing that through this)

Yes this sounds more like a plan. An isolated refactoring patch without would be appropriate I think. In contrast, making a quick fix just to avoid the llvm::Any seems less appealing to me.

include/lldb/Core/RichManglingInfo.h
83–84 ↗(On Diff #158274)

Ok, sounds reasonable. Let's try that. I will prepare a patch.

source/Core/RichManglingInfo.cpp
36–40 ↗(On Diff #158274)

Oh right, I should revisit that.

While we are here: It's a pity that only MSVC-mangled names trigger the fallback case with the new implementation. Thus, my unit test has no coverage for it and I didn't recognize the mistake. I thought about integrating ObjCLanguage parsing here too. On the one hand this would improve test coverage. On the other hand it would cause a semantic change in name indexing (and my ObjC knowledge is quite limited).

What do you think?

labath added a comment.Aug 3 2018, 3:03 AM

It sounds like we have a plan then! Look forward to seeing the results.

source/Core/RichManglingInfo.cpp
36–40 ↗(On Diff #158274)

I understand Zachary is working on an MSVC demangler as we speak. When that's finished, we should be able to add a couple of MSVC mangled names to your test case. (The MethodName class itself is tested elsewhere, so here we just need to test that things are wired up correctly.)

As for the ObjC idea, I am not sure what exactly would that entail, so I don't have an opinion on it. But I can certainly believe that there is a lot of room for improvement here.

sgraenitz added inline comments.Aug 3 2018, 3:08 AM
include/lldb/Core/RichManglingInfo.h
36 ↗(On Diff #158638)

I realized that returning llvm::StringRef from here may not be the best idea. To the outside it appears to be an immutable string, but internally it will be reused and even realloc'ed. I guess returning a ConstString is not acceptable as an alternative performance-wise, as it adds one memcpy per query. I will sketch something here, but it's not easy as it needs to fit both, the IPD and C++ MethodName.

61 ↗(On Diff #158638)

This must be llvm::any_isa<ParserT *>. Found during fallback-test and fixed.

source/Core/RichManglingInfo.cpp
36–40 ↗(On Diff #158274)

It's actually only a few lines to hack, so that the unit test covers the fallback case. I did it once now and found the llvm::any_isa<ParserT> bug. Otherwise it passes.

labath added inline comments.Aug 3 2018, 3:20 AM
include/lldb/Core/RichManglingInfo.h
36 ↗(On Diff #158638)

That is a bit unfortunate, but I don't think things would be any better if you left this as const char *. AFAICT, the caller converts this to a ConstString anyway, so returning that type might not be such a bad choice, but I would be fine with just documenting that the returned string is valid until the next GetXXX call.

source/Core/RichManglingInfo.cpp
93 ↗(On Diff #158638)

These .data() calls can now be removed.

source/Symbol/Symtab.cpp
381

remove .data()

sgraenitz added inline comments.Aug 3 2018, 8:33 AM
source/Core/RichManglingInfo.cpp
69–80 ↗(On Diff #158274)

While investigating the StringRef issue, I realized another problem here. The second param N to the IPD queries is used to pass the buffer size in and the string length out. With the current use of m_IPD_size I pass ever smaller buffer sizes in subsequent calls. That causes unnecessary reallocs. I should track the length out param separately.

source/Symbol/Symtab.cpp
381

Actually, this is dangerous as StringRef doesn't guarantee null termination. Will fix that.

sgraenitz updated this revision to Diff 159060.Aug 3 2018, 11:21 AM

Merge RichManglingInfo into RichManglingContext in order to reduce complexity when avoiding unnecessary reallocs.

I quickly realized, that it needs a bigger change to properly address the aforementioned issues. Here's the major changes. Some documentation is todo.

StringRef to reused buffer addressed with a split into first parse then get buffer. It seems more natural that the buffer may be reused. Also we can ask for const char * or a StringRef, which avoids the conversion issue.

MC.ParseFunctionBaseName();
llvm::StringRef base_name = MC.GetBufferRef();
[...]
MC.ParseFunctionDeclContextName();
const char *decl_context_cstr = MC.GetBufferAsCString();

buffer size in, string length out on success, buffer size out on fail is addressed in processIPDStrResult. It handles fails, reallocs and the regular cases and now distinguishes between m_IPD_buf_size and m_IPD_str_len.RichManglingContext maintains a single buffer now that is reused for all string-queries for Itanium. The fallback case always uses ConstString now and it may be a bit slower than before, but IMHO that's fine.

This change made the separation between RichManglingInfo and RichManglingContext unsustainable. So I merged them and now there are these ugly None cases, but otherwise it looks ok. I didn't adjust class and file names yet.

I touched the bookkeeping in Symtab::InitNamesIndexes() and added Symtab::RegisterBacklogEntry().

I did some benchmarking. The LLDB integrated timers add so much overhead that the numbers are not really helpful. Xcode Instruments gave good results (I can do it with the old impl on Monday). This is from lldb -b -- /path-to-ninja-build/relwithdebinfo/bin/clang:

16.00 ms   2.7%  Symtab::PreloadSymbols()
16.00 ms   2.7%    Symtab::InitNameIndexes()
 7.00 ms   1.1%      Mangled::DemangleWithRichManglingInfo(RichManglingConte...
 3.00 ms   0.5%        ConstString::SetCStringWithMangledCounterpart(char co...
 3.00 ms   0.5%        Pool::GetConstCStringAndSetMangledCounterPart(char co...
 2.00 ms   0.3%        RichManglingContext::ParseFullName() const
 1.00 ms   0.1%        lldb_skip_name(llvm::StringRef, Mangled::ManglingScheme)
 1.00 ms   0.1%        RichManglingContext::FromItaniumName(ConstString const&)
 4.00 ms   0.6%      Mangled::GuessLanguage() const
 4.00 ms   0.6%        Mangled::GetDemangledName(lldb::LanguageType) const
 2.00 ms   0.3%      Symtab::RegisterMangledNameEntry(UniqueCStringMap<unsig...
source/Core/RichManglingInfo.cpp
126 ↗(On Diff #159060)

Not sure about multi_in_out as a name here. It's hard to find one that fits the purpose and stays below 30 chars.. IPD calls it N.

labath added a comment.Aug 6 2018, 2:04 AM

I think we're getting there, but all these changes mean I have another round of comments. None of them should be major though.

I'm also wondering whether it wouldn't be good to add a couple (one for each mangling scheme?) unit tests for the new class, as a way to show the way the API is supposed to be used (without all of the Symtab stuff floating around).

I quickly realized, that it needs a bigger change to properly address the aforementioned issues. Here's the major changes. Some documentation is todo.

StringRef to reused buffer addressed with a split into first parse then get buffer. It seems more natural that the buffer may be reused. Also we can ask for const char * or a StringRef, which avoids the conversion issue.

MC.ParseFunctionBaseName();
llvm::StringRef base_name = MC.GetBufferRef();

I like this.

The fallback case always uses ConstString now and it may be a bit slower than before, but IMHO that's fine.

I agree it wouldn't be a disaster, but it also wasn't hard to fix things so this works correctly without C strings. So, I just went ahead and committed r338995, which should allow you to avoid messing with C strings in this class completely.

include/lldb/Core/RichManglingInfo.h
59 ↗(On Diff #159066)

Is this the right assertion? As far as I can tell, you initialize m_IPD_buf in the constructor unconditionally.

source/Core/Mangled.cpp
322–323

C string no longer needed here.

source/Core/RichManglingInfo.cpp
126 ↗(On Diff #159060)

The name makes no difference to me. However, if you think it makes sense, you could try passing getFunctionBaseName as a member function pointer into the helper function, so that all of the buffer handling is completely hidden inside that function.

47 ↗(On Diff #159066)

LLDB_LOG(log, demangled itanium: {0} -> \"{1}\"", mangled, GetBuffer());
shorter and avoids C strings

72 ↗(On Diff #159066)

will base_name always be non-empty (e.g. if the plugin failed to parse the string?). base_name.startswith('~') might be more appropriate here.

source/Symbol/Symtab.cpp
386–387

MC.GetBuffer().empty()

Ok I will fix the details, rebase and update the review.

Btw: I also spilt off the ConstString::IsNull() addition with unit test and fix in Mangled to https://reviews.llvm.org/D50327

source/Core/RichManglingInfo.cpp
126 ↗(On Diff #159060)

Thought about that, but didn't want to over-engineer this. It might be revisited in case we add more accessors in the future.

sgraenitz updated this revision to Diff 159556.Aug 7 2018, 11:21 AM
sgraenitz marked 5 inline comments as done.

Many small things

I'm also wondering whether it wouldn't be good to add a couple unit tests for the new class, as a way to show the way the API is supposed to be used

This is still missing, but working on it.

labath added inline comments.Aug 7 2018, 12:48 PM
source/Core/RichManglingContext.cpp
134

I thought we were going to get rid of this ConstString? The C++ plugin is able to provide the same (in fact, even stronger) lifetime guarantees as the IPD, so we could just have a single method that always returns a StringRef. (If you really want to, you can also have a ConstString-returning helper function, but it could be implemented the same way as it is now for IPD, so that both versions have the same runtime complexity.

I'd suggest having just a m_buffer StringRef object, which you set both here and in processIPDStrResult. Then GetBufferRef() can just return that object without any branching involved.

sgraenitz updated this revision to Diff 159654.Aug 8 2018, 1:58 AM

Add unit tests for RichManglingContext

sgraenitz added inline comments.Aug 8 2018, 2:19 AM
source/Core/RichManglingContext.cpp
134

Yes, I can change the ConstString thing here.

I'd suggest having just a m_buffer StringRef object

You mean a single buffer for both, IPD and C++ method parser plugin? Actually I don't like to mix that. In case of IPD the buffer is owned by the context. Otherwise it's owned by the C++ method parser plugin. I could manage that in FromItaniumName and FromCxxMethodName, but it will make everything more complicated. Top prio for me is to avoid reallocation of the relatively big initial m_IPD_buf and I think the cleanest way to do this is the RAII-style that's used already.

labath added inline comments.Aug 8 2018, 2:53 AM
source/Core/RichManglingContext.cpp
134

Well, kind of. I guess I shouldn't have said /just/ an m_buffer object. For the itanium case, this buffer would be /in addition/ to the existing m_PID_XXX members.

The operating invariant would be that m_buffer is a non-owning reference to the string which was set by the last ParseXXX call. So, for the CPlusPlus case it means you would here to something like m_buffer = get<CPlusPlusLanguage::MethodName>(m_cxx_method_parser)->GetBasename();. And in the itanium case, you would set m_buffer = StringRef(m_IPD_buf, m_IPD_str_size), after you finish all other work.

I don't think that should affect any of the work you do already, it should just make the GetBufferRef function simpler (and faster).

sgraenitz updated this revision to Diff 159678.Aug 8 2018, 3:53 AM

Polishing

sgraenitz marked an inline comment as done.Aug 8 2018, 3:58 AM

Tests, polishing and renaming (as discussed with @jingham).

If you have more comments on this, please let me know.
Otherwise I'd be fine with keeping it like this.

labath accepted this revision.Aug 8 2018, 6:28 AM

I think this looks great. I am very happy with the final result. Thank you for your patience, and thanks again for taking on this task.

This revision is now accepted and ready to land.Aug 8 2018, 6:28 AM

You and Pavel have done great work getting this all straight. I don't have much to add except some trivial quibbles about variable naming, a grammar error and some other inessentials.

include/lldb/Core/Mangled.h
323–325

The return description isn't right anymore.

include/lldb/Core/RichManglingContext.h
99

had -> would have

source/Core/Mangled.cpp
239–240

The LLVM conventions suggest only putting declarations in the anon namespace, and then having the definitions out of line:

http://www.llvm.org/docs/CodingStandards.html#anonymous-namespaces

305–306

locals in lldb are lower case always, and we really try not to use one letter variables (M & S below...)

source/Core/RichManglingContext.cpp
148

Again, don't use capitol letter local variables

source/Symbol/Symtab.cpp
376–377

Not MC either...

sgraenitz updated this revision to Diff 159767.Aug 8 2018, 11:44 AM
sgraenitz marked 5 inline comments as done.

Addressed Jim's feedback

sgraenitz marked an inline comment as done.Aug 8 2018, 11:49 AM
sgraenitz added inline comments.
source/Core/Mangled.cpp
239–240

Right. Well that's a C function, so removed anon namespace and declared static.

jingham accepted this revision.Aug 8 2018, 2:25 PM

Looks good.

This revision was automatically updated to reflect the committed changes.