This is a try to improve clangd-indexer tool performance:
- avoid processing already processed files.
- use different mutexes for different entities (e.g. do not block insertion of references while symbols are inserted)
Results for LLVM project indexing:
- before: ~30 minutes
- after: ~10 minutes
this changes the behavior of clangd-indexer slightly. it feels like an okay-ish trade-off considering the 3x speed up, but might have dire consequences.
Without this change, clangd-indexer can index headers in multiple configurations, e.g. if you have src1.cc that includes a.h with a -DFOO and src2.cc that includes a.h with -DBAR, a.h might end up producing different symbols. All of them are being indexed at the moment. After this change, the first one to index will win.
This is also what we do with background-index but we have different requirements for this behavior, as we store only a single shard per source file, even if we indexed sources in different configurations only one of them would prevail (unless we postpone shard writing to end of indexing and accumulate results in memory).
Also we are planning to make use of this binary in a server-like setup, and use the produced index to serve multiple clients. In some situations keeping symbols from multiple configurations might be really useful, but I am not so sure about it as they'll still be collapsed if USR generation produces same IDs for those symbols (and I think it will). So I am leaning towards making this change, but I would like to hear what others think too.