This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Refactor ObjCLanguage::MethodName
ClosedPublic

Authored by bulbazord on May 4 2023, 3:41 PM.

Details

Summary

The goal of this patch is to make it easier to reason about the state of
ObjCLanguage::MethodName. I do that in several ways:

  • Instead of using the constructor directly, you go through a factory method. It returns a std::optional<MethodName> so either you got an ObjCLanguage::MethodName or you didn't. No more checking if it's valid to know if you can use it or not.
  • ObjCLanguage::MethodName is now immutable. You cannot change its internals once it is created.
  • ObjCLanguage::MethodName::GetFullNameWithoutCategory previously had a parameter that let you get back an empty string if the method had no category. Every caller of this method was enabling this behavior so I dropped the parameter and made it the default behavior.
  • No longer store all the various components of the method name as ConstStrings. The relevant Get methods now return llvm::StringRefs backed by the MethodName's internal storage. The lifetime of these StringRefs are tied to the MethodName itself, so if you need to persist these you need to create copies.

Diff Detail

Event Timeline

bulbazord created this revision.May 4 2023, 3:41 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: arphaman. · View Herald Transcript
bulbazord requested review of this revision.May 4 2023, 3:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2023, 3:41 PM
jingham added a comment.EditedMay 9 2023, 4:49 PM

Most of this is fine. I wonder about avoiding caching the full name and name w/o category & selector name. One of the main uses of this class is to take incoming ObjC names from the ConstString pool, chop them up into full name, name w/o category, and selectorName, which we then insert into the lookup names tables for the modules they are found in. All those lookup table names will also exist in the ConstString pool.

The other instance where we get these names is when someone passes them to break set. In that case, we're going to break the name apart and then look it and its parts up in the ConstString lookup tables to find a match.

So it seems like this is a case where you aren't really getting savings in the ConstString pool, you're just causing more copying from std::string to ConstString.

lldb/source/Plugins/Language/ObjC/ObjCLanguage.h
37
100

What does this return if there's no category?

We have two behaviors above:

GetFullNameWithoutCategory - returns an empty string if the original class had no category
GetClassNameWithCategory - returns the original class name if it had no category
GetCategory - returns ??? if it had no category.

These seem a little skew somehow...

118–119

For the most part, we find ObjC method names either from debug info, from the symbol table or from the runtime symbol vendor. All of those are going to store their names in the ConstString pool.

If we only make these MethodName entities one by one then we're just doing a string copy every so often, and that shouldn't be noticeable. But if we make and hold a lot of these, we're doubling our storage requirements for a thing that at least on Darwin there are a lot of...

This one seems a reasonable candidate to be a ConstString...

Most of this is fine. I wonder about avoiding caching the full name and name w/o category & selector name. One of the main uses of this class is to take incoming ObjC names from the ConstString pool, chop them up into full name, name w/o category, and selectorName, which we then insert into the lookup names tables for the modules they are found in. All those lookup table names will also exist in the ConstString pool.

The full name being a ConstString might make sense since we're usually getting them from debug info or similar sources. When we chop them up, we can decide if it's worth making ConstStrings out of them. At least that's the idea.
That being said, I'll look in more detail about how often we actually try to create ObjCLanguage::MethodName objects. If we do it in a hot loop then maybe ConstString is the right thing to do.

The other instance where we get these names is when someone passes them to break set. In that case, we're going to break the name apart and then look it and its parts up in the ConstString lookup tables to find a match.

So it seems like this is a case where you aren't really getting savings in the ConstString pool, you're just causing more copying from std::string to ConstString.

What's the code path for this? I'm not 100% convinced we should be automatically storing portions of user input in the ConstString StringPool until we have a little more information about what they gave us.

lldb/source/Plugins/Language/ObjC/ObjCLanguage.h
37

Good catch, thanks!

100

GetCategory will return an empty StringRef if there is no category. I will explicitly document that.

I agree that there's some skew here though. The way I think about it is GetCategory and GetClassNameWithCategory do what they say they will do. If there's no category, both of these will leave out the category. The former will therefore be empty and the latter will give you what you want without the category. Maybe it could be empty though? I haven't thought about it.

118–119

Oh, I suppose that's true. I was looking at the call-sites and a lot of them were passing in const char *, but those likely come from ConstStrings at some layer. In that case, making m_full be a ConstString is probably fine.

BTW, from what I can tell, we only do make these ObjCLanguage::MethodName entities one-by-one. They are never really stored anywhere, they're mostly used to chop up things we think may be ObjCLanguage method names and get the parts out we find useful. Aside from the input, I wanted to create an interface where the caller can decide if the output of its functions should be stored in a ConstString or not instead of preemptively putting it in the StringPool.

Ok, I'm back with actual data on this patch. I have a test ObjC app with some UI, it loads a couple hundred modules. I place a breakpoint on main, I run until we hit main, and then I continue. My app will stop at -[NSBlockOperation main] after loading all the required modules. Here is the data without this patch applied:

% hyperfine -w 3 -- "$lldb -x -b -o 'b main' -o 'r' -o 'c' $App"
Benchmark 1: $lldb -x -b -o 'b main' -o 'r' -o 'c' $App
  Time (mean ± σ):      6.367 s ±  0.037 s    [User: 5.688 s, System: 0.295 s]
  Range (min … max):    6.312 s …  6.417 s    10 runs

...

  "memory": {
    "strings": {
      "bytesTotal": 133180053,
      "bytesUnused": 52050681,
      "bytesUsed": 81129372
    }
  },

Now, with the patch applied:

% hyperfine -w 3 -- "$lldb -x -b -o 'b main' -o 'r' -o 'c' $App"
Benchmark 1: $lldb -x -b -o 'b main' -o 'r' -o 'c' $App
  Time (mean ± σ):      6.209 s ±  0.046 s    [User: 5.554 s, System: 0.282 s]
  Range (min … max):    6.159 s …  6.295 s    10 runs

...

  "memory": {
    "strings": {
      "bytesTotal": 133175645,
      "bytesUnused": 52115081,
      "bytesUsed": 81060564
    }
  },

From this we can see that my patch actually is _slightly_ faster, saving a little more than 100ms on average. I ran hyperfine a few times under the same conditions and this is quite repeatable.
As for memory consumption, the ConstString StringPool is slightly smaller. The amount of bytes actually used in the StringPool shrank by about 60kb, and the total size of the StringPool shrank by about 5kb -- meaning we stored a lot less strings and grew the StringPool slightly less than before.

Now we had some discussion about ObjCLanguage::MethodName using a ConstString instead of a std::string to back its m_full member. I gathered the numbers on that one too:

% hyperfine -w 3 -- "$lldb -x -b -o 'b main' -o 'r' -o 'c' $App"
Benchmark 1: $lldb -x -b -o 'b main' -o 'r' -o 'c' $App
  Time (mean ± σ):      6.327 s ±  0.023 s    [User: 5.653 s, System: 0.296 s]
  Range (min … max):    6.299 s …  6.361 s    10 runs

...

  "memory": {
    "strings": {
      "bytesTotal": 133175645,
      "bytesUnused": 52115081,
      "bytesUsed": 81060564
    }
  },

From this, it looks like this is probably ever so slightly faster than with no patch, but still not as fast as using std::string. The memory consumption numbers are the exact same as using std::string. The conclusion I'm drawing from all the data here is that: 1.) My patch does decrease the size of the ConstString StringPool regardless of whether or not m_full is a std::string or a ConstString, and 2.) the std::string implementation is slightly faster on average than the ConstString one. I'll probably stick to this implementation since it appears to be faster on average (at least on my machine), but I'll update this PR to fix the documentation.

Update documentation for clarity

bulbazord marked 3 inline comments as done.May 11 2023, 12:46 PM

I'm not opposed to using this implementation, but have you considered using something like the stdlib regex library to do the heavy lifting?

I'm not opposed to using this implementation, but have you considered using something like the stdlib regex library to do the heavy lifting?

I had not considered it actually... I don't have any experience using anything from <regex>. I know LLVM has its own regex engine, perhaps that might be appropriate? I know LLDB uses it in its RegularExpression class.

Your test measured setting a found simple breakpoint. That should measure filling all the names caches - we do that the first time you try to set a breakpoint of any sort. But doesn't measure the effects on lookup. I am guessing you will find the same "not much difference" here as well, but it would be good to test that. So it would be good to also ensure you aren't slowing down looking for a selector by name, and looking for a selector you aren't going to find by name, and looking by full ObjC name. But if that's also true, then I'm fine with this.

Some more numbers:

% hyperfine -w 3 -- "$lldb -x -b -o 'b main' -o 'r' -o 'c' -o 'b $Selector' -o 'b $NonExistentSelector' -o 'b $FullObjCName' $App"
Benchmark 1: $lldb -x -b -o 'b main' -o 'r' -o 'c' -o 'b $Selector' -o 'b $NonExistentSelector' -o 'b $FullObjCName' $App
  Time (mean ± σ):      6.323 s ±  0.069 s    [User: 5.626 s, System: 0.301 s]
  Range (min … max):    6.224 s …  6.443 s    10 runs

% hyperfine -w 3 -- "$lldb -x -b -o 'b main' -o 'r' -o 'c' -o 'b $Selector' -o 'b $NonExistentSelector' -o 'b $FullObjCName' $App"
Benchmark 1: $lldb -x -b -o 'b main' -o 'r' -o 'c' -o 'b $Selector' -o 'b $NonExistentSelector' -o 'b $FullObjCName' $App
  Time (mean ± σ):      6.270 s ±  0.042 s    [User: 5.557 s, System: 0.313 s]
  Range (min … max):    6.210 s …  6.338 s    10 runs

I hope this is sufficient to show we're not regressing performance. I measured this a few times with and without my change and I observed that this is usually faster but they are usually within 100ms of each other.

LGTM, Adrian's comment is still outstanding however.

I'm not opposed to using this implementation, but have you considered using something like the stdlib regex library to do the heavy lifting?

I talked to Jonas and did a little research. It seems like <regex> is quite slow and many supported compilers have buggy implementations. If I were to use regular expressions to rewrite this, I'd likely use whatever LLVM has implemented. That being said, I'm not too keen on holding up this patch because of that.

aprantl accepted this revision.May 18 2023, 9:12 AM

One more comment inline but otherwise this is fine with me.

lldb/source/Plugins/Language/ObjC/ObjCLanguage.cpp
155

Using an llvm string stream to construct the string might be faster here.

This revision is now accepted and ready to land.May 18 2023, 9:12 AM
bulbazord marked an inline comment as done.May 18 2023, 2:47 PM
bulbazord added inline comments.
lldb/source/Plugins/Language/ObjC/ObjCLanguage.cpp
155

I tested this and found that llvm::SmallString<128> performs about as well as std::string on average. llvm::SmallString<64> was on average slower though.

I think I'm going to land this with std::string, we can revisit later if the difference actually is measurable.

This revision was automatically updated to reflect the committed changes.
bulbazord marked an inline comment as done.
aprantl added inline comments.May 18 2023, 3:04 PM
lldb/source/Plugins/Language/ObjC/ObjCLanguage.cpp
155

Maybe we were talking past each other: I was thinking about https://www.llvm.org/doxygen/classllvm_1_1raw__string__ostream.html

bulbazord added inline comments.May 18 2023, 3:06 PM
lldb/source/Plugins/Language/ObjC/ObjCLanguage.cpp
155

Oh, I totally misread what you said. I'll try that out and create a new PR if I can improve this.

bulbazord marked an inline comment as done.May 18 2023, 3:35 PM
bulbazord added inline comments.
lldb/source/Plugins/Language/ObjC/ObjCLanguage.cpp
155

Turns out it can be improved, but it wasn't with raw_string_ostream! https://reviews.llvm.org/D150914