This is an archive of the discontinued LLVM Phabricator instance.

Allow direct comparison of ConstString against StringRef
ClosedPublic

Authored by teemperor on Apr 14 2019, 6:28 AM.

Details

Summary

When we want to compare a ConstString against a string literal (or any other non-ConstString),
we currently have to explicitly turn the other string into a ConstString. This makes sense as
comparing ConstStrings against each other is only a fast pointer comparison.

However, currently we (rather incorrectly) use in several places in LLDB temporary ConstStrings when
we just want to compare a given ConstString against a hardcoded value, for example like this:

if (extension != ConstString(".oat") && extension != ConstString(".odex"))

Obviously this kind of defeats the point of ConstStrings. In the comparison above we would
construct two temporary ConstStrings every time we hit the given code. Constructing a
ConstString is relatively expensive: we need to go to the StringPool, take a read and possibly
an exclusive write-lock and then look up our temporary string in the string map of the pool.
So we do a lot of heavy work for essentially just comparing a <6 characters in two strings.

I initially wanted to just fix these issues by turning the temporary ConstString in static variables/
members, but that made the code much less readable. Instead I propose to add a new overload
for the ConstString comparison operator that takes a StringRef. This comparison operator directly
compares the ConstString content against the given StringRef without turning the StringRef into
a ConstString.

This means that the example above can look like this now:

if (extension != ".oat" && extension != ".odex")

It also no longer has to unlock/lock two locks and call multiple functions in other TUs for constructing
the temporary ConstString instances. Instead this should end up just being a direct string comparison
of the two given strings on most compilers.

This patch also directly updates all uses of temporary and short ConstStrings in LLDB to use this new
comparison operator. It also adds a some unit tests for the new and old comparison operator.

Diff Detail

Repository
rL LLVM

Event Timeline

teemperor created this revision.Apr 14 2019, 6:28 AM
JDevlieghere added a subscriber: JDevlieghere.

Thanks for doing this, Raphael!

I wonder if there's situations where this could be confusing/not what you'd expect, but I can't immediately think of anything. If that's the case we can always make the operator explicit, so this LGTM.

This revision is now accepted and ready to land.Apr 14 2019, 9:55 AM

I, too, have some concern that this could have unintended side effects. To make the temporary StringRefs from the zero-terminated strings requires a strlen call each time. So we're making two passes over a string each time (once to measure and once to compare). Granted, these are mostly very short.

Probably not (yet) practical, but I wonder if ConstString::ConstString(const char *) could be constexpr in the future or if the side effect of manipulating the string pool would always prohibit that.

I'd love to see the static const variables for the short constant strings, just to weigh readability hit.

lldb/include/lldb/Utility/ConstString.h
173 ↗(On Diff #195065)

This is very clever way to express the constraint, but it's quite a bit of cognitive load for the reader to figure out what's being tested here. The comment sort of explains, but leaves it up to the reader to see how that maps to the condition. (It also leaves me wondering whether it's useful to have ConstString treat empty strings and nullptr as distinct things.)

A simpler way to communicate the condition to humans might be:

if (m_string == nullptr && rhs.data() != nullptr) return false;
if (m_string != nullptr && rhs.data() == nullptr) return false;

The clever expression requires the reader to know or deduce:

  1. that m_string and rhs.data() are pointers,
  2. that !p is equivalent to p != 0 and that the 0 will be implicitly converted to nullptr, yielding a bool that's true if p is a nullptr,
  3. that ^ is bitwise exclusive or,
  4. that boolean ! and bitwise ^ have the appropriate operator precedence, which is not always obvious when mixing types like this,
  5. that true and false are guaranteed to have values such that bitwise ^ will do the expected thing to the promoted types,
  6. and that the resulting int will be implicitly compared for inequality to 0.

Most of these are reasonable things to expect a C++ programmer to know, but expecting them to apply all of that knowledge (correctly) to figure out what the expression does seems like an unnecessarily high cognitive burden.

lldb/source/Core/Disassembler.cpp
1407 ↗(On Diff #195065)

This is _probably_ a performance win here, but it's not obvious. If the resulting lambda is called many times, and it actually captured ConstString(info.name) and ConstString(info.alt_name) rather than info.name and info.alt_name, then we're trying off two pointer comparisons for two StringRef comparisons.

teemperor marked 2 inline comments as done.Apr 16 2019, 12:42 AM

I, too, have some concern that this could have unintended side effects. To make the temporary StringRefs from the zero-terminated strings requires a strlen call each time. So we're making two passes over a string each time (once to measure and once to compare). Granted, these are mostly very short.

I think I'm not understanding something here. There is one strlen call to make the StringRef from the literal, but I don't see why we need a second one to compare (as StringRef and ConstString both store the string length and don't recompute it when comparing). Simple godbolt test case here: https://godbolt.org/z/9X1RRt So both the old and new version both require one strlen call from what I can see (which is anyway optimized out for comparing against literals).

I'd love to see the static const variables for the short constant strings, just to weigh readability hit.

I can maybe make a new revision to show this, but I hope it's obvious from looking at the code that I changed in this revision. Also there are the problems/style difficulties that hopefully don't need a full revision to see:

  1. We preferable want to have variables with the same name as the string content, but often the strings contain reserved words or words that don't translate into our lower_case_code_style so that's not possible. E.g. var == this_ is not really as readable as var == "this" (and it looks like a typo). Same with stuff like if (arg_type.GetTypeName() == "bool") vs if (arg_type.GetTypeName() == bool_) or image_infos[i].segments[k].name == ConstString("__TEXT")) vs image_infos[i].segments[k].name == text).
  2. We get a bunch of local variables that just bloat the code:
if (descriptor->IsCFType()) {
  ConstString type_name(valobj.GetTypeName());
  if (type_name == "__CFMutableBitVector" || type_name == "__CFBitVector" ||
      type_name == "CFMutableBitVectorRef" || type_name == "CFBitVectorRef") {


    if (valobj.IsPointerType())
      is_type_ok = true;
  }
}

becomes this

if (descriptor->IsCFType()) {
  ConstString type_name(valobj.GetTypeName());
  static const ConstString cf_mutable_bit_vector("__CFMutableBitVector"),
                           cf_bit_vector("__CFBitVector"),
                           cf_mutable_bit_vector_ref("CFMutableBitVectorRef"),
                           cf_bit_vector_ref("CFBitVectorRef")
  if (type_name == cf_mutable_bit_vector || type_name == cf_bit_vector ||
      type_name == cf_mutable_bit_vector_ref || type_name == cf_bit_vector_ref) {
    if (valobj.IsPointerType())
      is_type_ok = true;
  }
}
  1. Code becomes much harder to grep. E.g. if I want to check for the location where we filter the this variable from the local variables, I can currently do git grep "\"this\"" | grep == and directly see where we do the comparison. But with local variables I have to do some funky grep with context filtering. In general grepping for these string literals without a few lines of context will be pointless with ConstString variables for literals.
lldb/include/lldb/Utility/ConstString.h
173 ↗(On Diff #195065)

Yeah, that was rather quickly typed down when making this patch. Replaced with your version, thanks!

Also, I don't like the whole nullptr/empty thing in ConstString, but I didn't want to change this behavior as a side effect of this patch.

lldb/source/Core/Disassembler.cpp
1407 ↗(On Diff #195065)

Thanks!

teemperor removed a reviewer: espindola.
  • Made empty/nullptr check more readable.
  • Removed some uses of the new comparison operator for cases where we don't have a literal to compare against (which makes it hard to estimate if this is really better).
  • Reworded documentation so that the length of the string doesn't really matter getting a performance benefit, only the fact that whether ConstString we pass is temporary or not.
amccarth accepted this revision.Apr 17 2019, 8:36 AM

I, too, have some concern that this could have unintended side effects. To make the temporary StringRefs from the zero-terminated strings requires a strlen call each time. So we're making two passes over a string each time (once to measure and once to compare). Granted, these are mostly very short.

I think I'm not understanding something here. There is one strlen call to make the StringRef from the literal, but I don't see why we need a second one to compare (as StringRef and ConstString both store the string length and don't recompute it when comparing). Simple godbolt test case here: https://godbolt.org/z/9X1RRt So both the old and new version both require one strlen call from what I can see (which is anyway optimized out for comparing against literals).

I didn't mean to imply that strlen is called twice. My point was that there are two linear passes over the string literal: The first is the call to strlen to make the StringRef, and the second is the actual string comparison.

I'd love to see the static const variables for the short constant strings, just to weigh readability hit.

I can maybe make a new revision to show this, but I hope it's obvious from looking at the code that I changed in this revision. Also there are the problems/style difficulties that hopefully don't need a full revision to see:

  1. We preferable want to have variables with the same name as the string content, but often the strings contain reserved words or words that don't translate into our lower_case_code_style so that's not possible. E.g. var == this_ is not really as readable as var == "this" (and it looks like a typo). Same with stuff like if (arg_type.GetTypeName() == "bool") vs if (arg_type.GetTypeName() == bool_) or image_infos[i].segments[k].name == ConstString("__TEXT")) vs image_infos[i].segments[k].name == text).
  2. We get a bunch of local variables that just bloat the code:
if (descriptor->IsCFType()) {
  ConstString type_name(valobj.GetTypeName());
  if (type_name == "__CFMutableBitVector" || type_name == "__CFBitVector" ||
      type_name == "CFMutableBitVectorRef" || type_name == "CFBitVectorRef") {


    if (valobj.IsPointerType())
      is_type_ok = true;
  }
}

becomes this

if (descriptor->IsCFType()) {
  ConstString type_name(valobj.GetTypeName());
  static const ConstString cf_mutable_bit_vector("__CFMutableBitVector"),
                           cf_bit_vector("__CFBitVector"),
                           cf_mutable_bit_vector_ref("CFMutableBitVectorRef"),
                           cf_bit_vector_ref("CFBitVectorRef")
  if (type_name == cf_mutable_bit_vector || type_name == cf_bit_vector ||
      type_name == cf_mutable_bit_vector_ref || type_name == cf_bit_vector_ref) {
    if (valobj.IsPointerType())
      is_type_ok = true;
  }
}
  1. Code becomes much harder to grep. E.g. if I want to check for the location where we filter the this variable from the local variables, I can currently do git grep "\"this\"" | grep == and directly see where we do the comparison. But with local variables I have to do some funky grep with context filtering. In general grepping for these string literals without a few lines of context will be pointless with ConstString variables for literals.

Fair enough. Thanks for the details.

teemperor updated this revision to Diff 196732.Apr 25 2019, 2:53 PM
  • Removed the remaining uses of the new operator for comparisons against values that are not literals.

I didn't mean to imply that strlen is called twice. My point was that there are two linear passes over the string literal: The first is the call to strlen to make the StringRef, and the second is the actual string comparison.

Thanks for the explanation. I reverted the changes that introduce the new operator for non-literal values and I'll see if I can replace those by making ConstString variables. So this patch now only introduces comparisons against literals where the strlen call can be optimized out (which means we only do a direct size cmp for both strings and then a strcmp).

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2019, 12:19 AM