This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Add a setting to skip long mangled names
AbandonedPublic

Authored by JDevlieghere on Feb 2 2022, 9:50 AM.

Details

Summary

Libraries which rely heavily on templates (e.g. boost) can generate extremely long symbol names, with mangled names in the 10.000 of characters. These symbols take a long time to demangle and can results in unmangled names that are several megabytes in size. This patch adds a setting to skip past these symbols when indexing the symbol table to speed up launch/attach times and keep memory usage in check.

I arbitrarily picked 1000 as the default value which seems large enough to not affect most workflows.

Diff Detail

Event Timeline

JDevlieghere requested review of this revision.Feb 2 2022, 9:50 AM
JDevlieghere created this revision.

Any chance you might want a limit on the size of the demangled name too? (might be worth considering what the most densely encoded mangled name is (ie: what's the longest name that could be produced by a 10k long mangled name? and see if that's worth having another cutoff for)

Is it worth trying to come up with a limit that's not arbitrarily picked?

lldb/source/Symbol/Symtab.cpp
300

technically demangling_limit

jingham added a comment.EditedFeb 2 2022, 10:27 AM

Any chance you might want a limit on the size of the demangled name too? (might be worth considering what the most densely encoded mangled name is (ie: what's the longest name that could be produced by a 10k long mangled name? and see if that's worth having another cutoff for)

Ironically, lldb seldom cares about most of the goo in these long demangled names. At this point, we are building up our fast-lookup "name indexes". We really only care about extracting the fully scoped names of the methods. When we get around to doing smart matching on overloads, we can still pull out all the matches to the method name, and then do the overload match on the results. That should be sufficiently efficient, and obviate the need to do any fancy indexing based on overloads. So most of the work of demangling these names is not being used anyway.

So what would be the better solution for lldb on the demangling front would be a way to tell the demangler "only extract the full method name, and don't bother producing the argument list or return values". But I have no idea how easy that would be in the demangler.

Note, I am not suggesting this as a replacement for this patch, since this fixes some really silly cases that template abuse has inflicted on us. This is a longer term suggestion.

JDevlieghere edited the summary of this revision. (Show Details)Feb 2 2022, 10:51 AM
JDevlieghere edited the summary of this revision. (Show Details)

Any chance you might want a limit on the size of the demangled name too? (might be worth considering what the most densely encoded mangled name is (ie: what's the longest name that could be produced by a 10k long mangled name? and see if that's worth having another cutoff for)

Ironically, lldb seldom cares about most of the goo in these long demangled names. At this point, we are building up our fast-lookup "name indexes". We really only care about extracting the fully scoped names of the methods. When we get around to doing smart matching on overloads, we can still pull out all the matches to the method name, and then do the overload match on the results. That should be sufficiently efficient, and obviate the need to do any fancy indexing based on overloads. So most of the work of demangling these names is not being used anyway.

So what would be the better solution for lldb on the demangling front would be a way to tell the demangler "only extract the full method name, and don't bother producing the argument list or return values". But I have no idea how easy that would be in the demangler.

I think there's an API level of the demangler in LLVM designed for rewriting demangled names (@rsmith created/implemented that, I think) - I'm not sure if it's structured to allow lazy parsing/stopping after you get the base name, for instance, but maybe...

Can we put a limit on the length of the kinds of names we are willing to demangle in the first place? How long are some of these names _prior_ to demangling? It would be great if we could skip demangling names that are too long to begin with. That would allow us to skip trying to create the demangled name in the first place which is part of the memory problem right? Once the mangled name has been added to the ConstString pool it is already too late and we won't save any memory.

Can we put a limit on the length of the kinds of names we are willing to demangle in the first place? How long are some of these names _prior_ to demangling? It would be great if we could skip demangling names that are too long to begin with. That would allow us to skip trying to create the demangled name in the first place which is part of the memory problem right? Once the mangled name has been added to the ConstString pool it is already too late and we won't save any memory.

Yep, that's exactly what this patch does: it checks the length of the mangled name and skips demangling if they are too long. It only does this while building the index though, so that if we need the demangled name elsewere we'll still cache it in the string pool.

Any chance you might want a limit on the size of the demangled name too? (might be worth considering what the most densely encoded mangled name is (ie: what's the longest name that could be produced by a 10k long mangled name? and see if that's worth having another cutoff for)

Ironically, lldb seldom cares about most of the goo in these long demangled names. At this point, we are building up our fast-lookup "name indexes". We really only care about extracting the fully scoped names of the methods. When we get around to doing smart matching on overloads, we can still pull out all the matches to the method name, and then do the overload match on the results. That should be sufficiently efficient, and obviate the need to do any fancy indexing based on overloads. So most of the work of demangling these names is not being used anyway.

So what would be the better solution for lldb on the demangling front would be a way to tell the demangler "only extract the full method name, and don't bother producing the argument list or return values". But I have no idea how easy that would be in the demangler.

I think there's an API level of the demangler in LLVM designed for rewriting demangled names (@rsmith created/implemented that, I think) - I'm not sure if it's structured to allow lazy parsing/stopping after you get the base name, for instance, but maybe...

We should definitely look into that as a general optimization for indexing the string table and would make sense in combination with D118814. For this particular patch, we're trying to avoid demangling at all if the symbol is too long, so unless a partial demangle is really cheap (it might be) we'd still want to exclude symbols based on their mangled length.

Can we put a limit on the length of the kinds of names we are willing to demangle in the first place? How long are some of these names _prior_ to demangling? It would be great if we could skip demangling names that are too long to begin with. That would allow us to skip trying to create the demangled name in the first place which is part of the memory problem right? Once the mangled name has been added to the ConstString pool it is already too late and we won't save any memory.

Yep, that's exactly what this patch does: it checks the length of the mangled name and skips demangling if they are too long. It only does this while building the index though, so that if we need the demangled name elsewere we'll still cache it in the string pool.

Yikes sorry, I read line Symtab.cpp:311 as if it was demangling the name first and then checking _its_ length...

The settings leads one to believe that we don't ever demangle names longer than X characters, but it really means "don't auto demangle when indexing the symbol table". Should we actually stop the names from ever being demangled in "Mangled::GetDemangledName()" to force us to never try and demangle these kinds of names ever? We would need to have the setting modify a static value in that Mangled class when the setting gets changed so that the Mangled::GetDemangledName() function wouldn't have to try and access the settings each time someone wanted to demangle something. Then we will still display the mangled name in stack traces, but we won't ever waste time trying to demangle these names. When they are that long, they are really not useful at all for me at least.

Yikes sorry, I read line Symtab.cpp:311 as if it was demangling the name first and then checking _its_ length...

The settings leads one to believe that we don't ever demangle names longer than X characters, but it really means "don't auto demangle when indexing the symbol table". Should we actually stop the names from ever being demangled in "Mangled::GetDemangledName()" to force us to never try and demangle these kinds of names ever? We would need to have the setting modify a static value in that Mangled class when the setting gets changed so that the Mangled::GetDemangledName() function wouldn't have to try and access the settings each time someone wanted to demangle something. Then we will still display the mangled name in stack traces, but we won't ever waste time trying to demangle these names. When they are that long, they are really not useful at all for me at least.

Yeah, I considered that but I noticed some uses of the demangled names in the language plugins and I wasn't sure if that was going to break. If we think that it never makes sense to demangle these name, I'm happy to move it into Demangled and make it so that we never demangle these.

jingham added a comment.EditedFeb 2 2022, 4:14 PM

If we can distinguish between "when we handle all mangled names" and "when we handle one" I think we should continue to demangle names in the "when we handle one" case, since you never know when somebody really might need to look at the whole name and that won't be super expensive even in the worst case. But OTOH that probably happens seldom, so if there's no good way to distinguish, I'm also fine with never demangling over some limit.

labath added a comment.Feb 3 2022, 1:24 AM

Any chance you might want a limit on the size of the demangled name too? (might be worth considering what the most densely encoded mangled name is (ie: what's the longest name that could be produced by a 10k long mangled name? and see if that's worth having another cutoff for)

Ironically, lldb seldom cares about most of the goo in these long demangled names. At this point, we are building up our fast-lookup "name indexes". We really only care about extracting the fully scoped names of the methods. When we get around to doing smart matching on overloads, we can still pull out all the matches to the method name, and then do the overload match on the results. That should be sufficiently efficient, and obviate the need to do any fancy indexing based on overloads. So most of the work of demangling these names is not being used anyway.

So what would be the better solution for lldb on the demangling front would be a way to tell the demangler "only extract the full method name, and don't bother producing the argument list or return values". But I have no idea how easy that would be in the demangler.

I think there's an API level of the demangler in LLVM designed for rewriting demangled names (@rsmith created/implemented that, I think) - I'm not sure if it's structured to allow lazy parsing/stopping after you get the base name, for instance, but maybe...

We should definitely look into that as a general optimization for indexing the string table and would make sense in combination with D118814. For this particular patch, we're trying to avoid demangling at all if the symbol is too long, so unless a partial demangle is really cheap (it might be) we'd still want to exclude symbols based on their mangled length.

The most expensive step in demangling is the actual construction of the demangled string. It's fairly easy to make that exponential (because the the output string can be exponentially larger than the input). The construction of AST (well, a kind of a DAG actually), should always be linear.

And extracting the name this way will also save us from having to another parse of the demangled name (to extract the base name), so it's double goodness. I don't think the actual extraction should be that hard. The trickiest part is understanding the way in which the name are encoded so that you know what to look for.

labath added a comment.Feb 3 2022, 1:45 AM

(ie: what's the longest name that could be produced by a 10k long mangled name? and see if that's worth having another cutoff for)

I can create a 1MB demangled name from a string of just 123 bytes. By my estimate, a 1000-char mangled name could produce a demangled name whose length would have 43 digits. 10k mangled => 10^430 demanged, etc.

labath added a comment.Feb 3 2022, 2:32 AM

And extracting the name this way will also save us from having to another parse of the demangled name (to extract the base name), so it's double goodness. I don't think the actual extraction should be that hard.

Actually, I now see that we already have code which does just that. The mangled.DemangleWithRichManglingInfo call will use the "partial" demangler to extract the interesting pieces of the mangled name. It will also save that demangled name, but it only does that to avoid another demangling operation. We could easily make it skip that step. And I see that the other patch does just that...

So I guess my question is: what does this patch buy us vs. just doing the second patch alone? It seems like it would be nice to be able to let the user break do break set -n foo and have it stop on Something<RidiculouslyLongTemplate<...>>::foo..

Is it maybe because then the Something<RidiculouslyLongTemplate<...>> will be stored as a part foos "context" ? If so, maybe we could have it avoid storing the context instead? Or even better: store a simplified version of the scope with template gunk above some level removed?

JDevlieghere abandoned this revision.Feb 3 2022, 8:41 AM

Thanks Pavel. I should've dug a little deeper in the rich mangling context. After giving it a second look I think you're absolutely right and we can drop this patch in favor of D118814.