This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Improve clangd-indexer performance
ClosedPublic

Authored by ArcsinX on Nov 9 2020, 12:15 AM.

Details

Summary

This is a try to improve clangd-indexer tool performance:

  • avoid processing already processed files.
  • use different mutexes for different entities (e.g. do not block insertion of references while symbols are inserted)

Results for LLVM project indexing:

  • before: ~30 minutes
  • after: ~10 minutes

Diff Detail

Event Timeline

ArcsinX created this revision.Nov 9 2020, 12:15 AM
ArcsinX requested review of this revision.Nov 9 2020, 12:15 AM
kadircet added inline comments.Nov 9 2020, 12:38 AM
clang-tools-extra/clangd/indexer/IndexerMain.cpp
46

this changes the behavior of clangd-indexer slightly. it feels like an okay-ish trade-off considering the 3x speed up, but might have dire consequences.

Without this change, clangd-indexer can index headers in multiple configurations, e.g. if you have src1.cc that includes a.h with a -DFOO and src2.cc that includes a.h with -DBAR, a.h might end up producing different symbols. All of them are being indexed at the moment. After this change, the first one to index will win.

This is also what we do with background-index but we have different requirements for this behavior, as we store only a single shard per source file, even if we indexed sources in different configurations only one of them would prevail (unless we postpone shard writing to end of indexing and accumulate results in memory).

Also we are planning to make use of this binary in a server-like setup, and use the produced index to serve multiple clients. In some situations keeping symbols from multiple configurations might be really useful, but I am not so sure about it as they'll still be collapsed if USR generation produces same IDs for those symbols (and I think it will). So I am leaning towards making this change, but I would like to hear what others think too.

hokein added inline comments.Nov 9 2020, 4:05 AM
clang-tools-extra/clangd/indexer/IndexerMain.cpp
46

+1, this looks like a good trade-off to me.

This seems like an accuracy/latency tradeoff in an environment where... it's not clear why we care about latency very much. Do we?

OTOH, how useful is it to have a static index that's more accurate if it's going to be shadowed by a dynamic index for the files you care most about.

ArcsinX added a comment.EditedNov 9 2020, 4:57 AM

Thank you all for your comments.

I will try to describe my thoughts:

  • the more data SymbolCollector will collect, the greater the difference (N will increase in this patch improves performance in N times). E.g. collection call/contain relations will affect clangd-indexer performance significantly while clangd background indexer will not have such huge affection.
  • performance improvement will help to update the index file more often (for remote server scenario). I mean an index file will be up to date every K commits instead of every 3 x K commits (assume constant time difference between commits)
  • the behavior with this patch is basically the same as clangd background indexer behavior.

If this patch could not be approved as is, then maybe we could add command line option to switch old/new behavior. What do you think? (Сan't think of a suitable name for this option =))

kadircet accepted this revision.Nov 9 2020, 7:28 AM

Looks like everyone thinks that this sounds reasonable. So LGTM. Thanks for the patch!

This revision is now accepted and ready to land.Nov 9 2020, 7:28 AM
ArcsinX updated this revision to Diff 304262.Nov 10 2020, 11:00 AM

Don't check that AbsPath is not in Files twice.

This revision was automatically updated to reflect the committed changes.