This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Replace StringRef in SymbolLocation with a char pointer.
ClosedPublic

Authored by hokein on Oct 19 2018, 3:05 AM.

Details

Summary

This would save us 8 bytes per ref, and buy us ~50MB in total for llvm
index (from ~350MB to ~300 MB).

The char pointer must be null-terminated, and llvm::StringSaver
guarantees it.

Event Timeline

hokein created this revision.Oct 19 2018, 3:05 AM

Nice savings!

My initial thought is that it seems weird to only do it for this one string - the readers look different, the serialization code has different logic etc for the different flavors of strings.
The struct certainly looks nicer to my eyes.

This may be the right tradeoff, but I'd like to play with/think about other options for API here. One is to define a class like NullTerminatedStringRef (with a better name) that wraps a char* only, but otherwise behaves as much like stringref as we can manage. Happy for you to try some of these out or to try them myself.

I think this is worth a bit of thought before committing because as shown in this patch, these APIs have quite a lot of uses!

hokein updated this revision to Diff 172949.Nov 7 2018, 6:56 AM

Move forward the patch based on the offline discussion.

MaskRay added inline comments.Nov 7 2018, 10:37 AM
clangd/index/Index.h
84–86

The StringRef(const char*) constructor calls strlen. Do you believe this does not have noticeable performance issue?

(A minor point: tuple::operator< calls operator< of its components in a not very efficient way. The compiler does not know compare(a, b) < 0 implies compare(b, a) > 0 and strcmp may end up being called twice.)

hokein updated this revision to Diff 173842.Nov 13 2018, 6:54 AM
hokein marked an inline comment as done.

update the patch.

hokein added inline comments.Nov 13 2018, 6:55 AM
clangd/index/Index.h
84–86

Thanks for the suggestions. The code here has an addition strlen cost. I updated the patch to avoid this additional cost and the potential double < calls of string.

sammccall accepted this revision.Nov 13 2018, 7:18 AM
sammccall added inline comments.
clangd/index/Index.h
85

nit: I think it's more idiomatic just to inline as

return !strcmp(L.FileURI, R.FileURI) && ...
93

tie(strcmp(L.FileURI, R.FileURI), ...) < tie(0, ...)?

306
assert (!S.data()[S.size()] && "Visited StringRef must be null-terminated")
clangd/index/Merge.cpp
118

Since this reads awkwardly anyway...

bool(*L.Definition.FileURI) == bool(*R.definition.FileURI)

is shorter and faster

This revision is now accepted and ready to land.Nov 13 2018, 7:18 AM
hokein updated this revision to Diff 173856.Nov 13 2018, 7:42 AM
hokein marked 3 inline comments as done.

Address review comments.

clangd/index/Index.h
93

tie doesn't support this usage, it expects lvalue arguments.

306

Does this assert actually work? The StringRef is constructed from the char * (which using strlen). This assert will not get triggered I think, S.data()[S.size()] is always \0.

clangd/index/Merge.cpp
118

this is neat!

sammccall added inline comments.Nov 13 2018, 7:56 AM
clangd/index/Index.h
306

The point is the callback takes a mutable reference, and can replace the stringref.
We're asserting that the *new* stringref is nullterminated (so we can convert it safely to a char*)

hokein updated this revision to Diff 173861.Nov 13 2018, 8:27 AM
hokein marked an inline comment as done.

Add assert.

hokein added inline comments.Nov 13 2018, 8:27 AM
clangd/index/Index.h
306

Ah, that makes sense, I missed the fact that the CB mutates S.

This revision was automatically updated to reflect the committed changes.
alexfh added a subscriber: alexfh.Nov 14 2018, 5:31 AM
alexfh added inline comments.
clang-tools-extra/trunk/clangd/index/Index.h
71 ↗(On Diff #174013)

If the users of the SymbolLocation have a way to get a common "context", you could keep a list of filenames in this context and just store an index into this list (which could be even shorter). If the structure is used where no context is available, you could still keep a list of StringRefs somewhere (probably close to where the string data pointed by the StringRefs lives) and keep a StringRef * here. At a cost of one indirection you'd avoid calling strlen.

Not sure what trade-off is better for this particular use case though.

hokein added inline comments.Nov 14 2018, 5:59 AM
clang-tools-extra/trunk/clangd/index/Index.h
71 ↗(On Diff #174013)

SymbolLocation is a public interface, and the underlying data is stored in arena (which is not visible to clients), so clients don't know any Context except SymbolLocation.

Using StringRef * is an interesting idea -- keeping a list of StringRefs needs extra space, the number should be relatively small (only the number of different files), so we will spend extra <num of files> * sizeof(StringRef) bytes (for llvm, ~128 KB memory) , compared with the raw pointer solution.

alexfh added inline comments.Nov 14 2018, 6:09 AM
clang-tools-extra/trunk/clangd/index/Index.h
71 ↗(On Diff #174013)

Another option mentioned by Ilya offline is similar to MS's BSTR: keep the length along with the string data in the arena. Something like:

class InlineString {
  size_t Length;
  char Data[0];  // Open-ended array with a null-terminated string in it.
 public:
   operator StringRef() const { return StringRef(&Data[0], Length); }
   ...
};

And then keep const InlineString * here.

sammccall added inline comments.Nov 14 2018, 6:38 AM
clang-tools-extra/trunk/clangd/index/Index.h
71 ↗(On Diff #174013)

Yeah, operator StringRef() doesn't really work well unfortunately (I tried a StringRefz which just wrapped a char pointer, in a bid to avoid accidentally using pointer semantics).

Conversion rules are too complicated and existing APIs are already maximally constrained, so you can't satisfy all of {functions that take StringRef only, functions that take Twine only, functions that overload for StringRef/char*/std::string, functions that *also* overload for Twine, ...}