This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/clangd/indexer/
-
clangd/
-
indexer/
2
IndexerMain.cpp

Differential D91051

[clangd] Improve clangd-indexer performance
ClosedPublic

Authored by ArcsinX on Nov 9 2020, 12:15 AM.

Download Raw Diff

Details

Reviewers

sammccall
kadircet
hokein

Commits

rGdad804a193ed: [clangd] Improve clangd-indexer performance

Summary

This is a try to improve clangd-indexer tool performance:

avoid processing already processed files.
use different mutexes for different entities (e.g. do not block insertion of references while symbols are inserted)

Results for LLVM project indexing:

before: ~30 minutes
after: ~10 minutes

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ArcsinX created this revision.Nov 9 2020, 12:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 9 2020, 12:15 AM

Herald added subscribers: cfe-commits, usaxena95, kadircet, arphaman. · View Herald Transcript

ArcsinX requested review of this revision.Nov 9 2020, 12:15 AM

Herald added subscribers: MaskRay, ilya-biryukov. · View Herald TranscriptNov 9 2020, 12:15 AM

ArcsinX added reviewers: sammccall, kadircet, hokein.Nov 9 2020, 12:16 AM

kadircet added inline comments.Nov 9 2020, 12:38 AM

clang-tools-extra/clangd/indexer/IndexerMain.cpp
46	this changes the behavior of clangd-indexer slightly. it feels like an okay-ish trade-off considering the 3x speed up, but might have dire consequences. Without this change, clangd-indexer can index headers in multiple configurations, e.g. if you have src1.cc that includes a.h with a `-DFOO` and src2.cc that includes a.h with `-DBAR`, a.h might end up producing different symbols. All of them are being indexed at the moment. After this change, the first one to index will win. This is also what we do with background-index but we have different requirements for this behavior, as we store only a single shard per source file, even if we indexed sources in different configurations only one of them would prevail (unless we postpone shard writing to end of indexing and accumulate results in memory). Also we are planning to make use of this binary in a server-like setup, and use the produced index to serve multiple clients. In some situations keeping symbols from multiple configurations might be really useful, but I am not so sure about it as they'll still be collapsed if USR generation produces same IDs for those symbols (and I think it will). So I am leaning towards making this change, but I would like to hear what others think too.

Harbormaster completed remote builds in B78066: Diff 303761.Nov 9 2020, 12:58 AM

hokein added inline comments.Nov 9 2020, 4:05 AM

clang-tools-extra/clangd/indexer/IndexerMain.cpp
46	+1, this looks like a good trade-off to me.

This seems like an accuracy/latency tradeoff in an environment where... it's not clear why we care about latency very much. Do we?

OTOH, how useful is it to have a static index that's more accurate if it's going to be shadowed by a dynamic index for the files you care most about.

Thank you all for your comments.

I will try to describe my thoughts:

the more data SymbolCollector will collect, the greater the difference (N will increase in this patch improves performance in N times). E.g. collection call/contain relations will affect clangd-indexer performance significantly while clangd background indexer will not have such huge affection.
performance improvement will help to update the index file more often (for remote server scenario). I mean an index file will be up to date every K commits instead of every 3 x K commits (assume constant time difference between commits)
the behavior with this patch is basically the same as clangd background indexer behavior.

If this patch could not be approved as is, then maybe we could add command line option to switch old/new behavior. What do you think? (Сan't think of a suitable name for this option =))

Looks like everyone thinks that this sounds reasonable. So LGTM. Thanks for the patch!

This revision is now accepted and ready to land.Nov 9 2020, 7:28 AM

Don't check that AbsPath is not in Files twice.

Harbormaster completed remote builds in B78340: Diff 304262.Nov 10 2020, 11:50 AM

Closed by commit rGdad804a193ed: [clangd] Improve clangd-indexer performance (authored by ArcsinX). · Explain WhyNov 11 2020, 3:39 AM

This revision was automatically updated to reflect the committed changes.

ArcsinX added a commit: rGdad804a193ed: [clangd] Improve clangd-indexer performance.

Revision Contents

Path

Size

clang-tools-extra/

clangd/

indexer/

IndexerMain.cpp

18 lines

Diff 304454

clang-tools-extra/clangd/indexer/IndexerMain.cpp

	Show All 37 Lines

	class IndexActionFactory : public tooling::FrontendActionFactory {			class IndexActionFactory : public tooling::FrontendActionFactory {
	public:			public:
	IndexActionFactory(IndexFileIn &Result) : Result(Result) {}			IndexActionFactory(IndexFileIn &Result) : Result(Result) {}

	std::unique_ptr<FrontendAction> create() override {			std::unique_ptr<FrontendAction> create() override {
	SymbolCollector::Options Opts;			SymbolCollector::Options Opts;
	Opts.CountReferences = true;			Opts.CountReferences = true;
				Opts.FileFilter = [&](const SourceManager &SM, FileID FID) {
				kadircetUnsubmitted Not Done Reply Inline Actions this changes the behavior of clangd-indexer slightly. it feels like an okay-ish trade-off considering the 3x speed up, but might have dire consequences. Without this change, clangd-indexer can index headers in multiple configurations, e.g. if you have src1.cc that includes a.h with a `-DFOO` and src2.cc that includes a.h with `-DBAR`, a.h might end up producing different symbols. All of them are being indexed at the moment. After this change, the first one to index will win. This is also what we do with background-index but we have different requirements for this behavior, as we store only a single shard per source file, even if we indexed sources in different configurations only one of them would prevail (unless we postpone shard writing to end of indexing and accumulate results in memory). Also we are planning to make use of this binary in a server-like setup, and use the produced index to serve multiple clients. In some situations keeping symbols from multiple configurations might be really useful, but I am not so sure about it as they'll still be collapsed if USR generation produces same IDs for those symbols (and I think it will). So I am leaning towards making this change, but I would like to hear what others think too. kadircet: this changes the behavior of clangd-indexer slightly. it feels like an okay-ish trade-off…
				hokeinUnsubmitted Not Done Reply Inline Actions +1, this looks like a good trade-off to me. hokein: +1, this looks like a good trade-off to me.
				const auto *F = SM.getFileEntryForID(FID);
				if (!F)
				return false; // Skip invalid files.
				auto AbsPath = getCanonicalPath(F, SM);
				if (!AbsPath)
				return false; // Skip files without absolute path.
				std::lock_guard<std::mutex> Lock(FilesMu);
				return Files.insert(*AbsPath).second; // Skip already processed files.
				};
	return createStaticIndexingAction(			return createStaticIndexingAction(
	Opts,			Opts,
	[&](SymbolSlab S) {			[&](SymbolSlab S) {
	// Merge as we go.			// Merge as we go.
	std::lock_guard<std::mutex> Lock(SymbolsMu);			std::lock_guard<std::mutex> Lock(SymbolsMu);
	for (const auto &Sym : S) {			for (const auto &Sym : S) {
	if (const auto *Existing = Symbols.find(Sym.ID))			if (const auto *Existing = Symbols.find(Sym.ID))
	Symbols.insert(mergeSymbol(*Existing, Sym));			Symbols.insert(mergeSymbol(*Existing, Sym));
	else			else
	Symbols.insert(Sym);			Symbols.insert(Sym);
	}			}
	},			},
	[&](RefSlab S) {			[&](RefSlab S) {
	std::lock_guard<std::mutex> Lock(SymbolsMu);			std::lock_guard<std::mutex> Lock(RefsMu);
	for (const auto &Sym : S) {			for (const auto &Sym : S) {
	// Deduplication happens during insertion.			// Deduplication happens during insertion.
	for (const auto &Ref : Sym.second)			for (const auto &Ref : Sym.second)
	Refs.insert(Sym.first, Ref);			Refs.insert(Sym.first, Ref);
	}			}
	},			},
	[&](RelationSlab S) {			[&](RelationSlab S) {
	std::lock_guard<std::mutex> Lock(SymbolsMu);			std::lock_guard<std::mutex> Lock(RelsMu);
	for (const auto &R : S) {			for (const auto &R : S) {
	Relations.insert(R);			Relations.insert(R);
	}			}
	},			},
	/IncludeGraphCallback=/nullptr);			/IncludeGraphCallback=/nullptr);
	}			}

	// Awkward: we write the result in the destructor, because the executor			// Awkward: we write the result in the destructor, because the executor
	// takes ownership so it's the easiest way to get our data back out.			// takes ownership so it's the easiest way to get our data back out.
	~IndexActionFactory() {			~IndexActionFactory() {
	Result.Symbols = std::move(Symbols).build();			Result.Symbols = std::move(Symbols).build();
	Result.Refs = std::move(Refs).build();			Result.Refs = std::move(Refs).build();
	Result.Relations = std::move(Relations).build();			Result.Relations = std::move(Relations).build();
	}			}

	private:			private:
	IndexFileIn &Result;			IndexFileIn &Result;
				std::mutex FilesMu;
				llvm::StringSet<> Files;
	std::mutex SymbolsMu;			std::mutex SymbolsMu;
	SymbolSlab::Builder Symbols;			SymbolSlab::Builder Symbols;
				std::mutex RefsMu;
	RefSlab::Builder Refs;			RefSlab::Builder Refs;
				std::mutex RelsMu;
	RelationSlab::Builder Relations;			RelationSlab::Builder Relations;
	};			};

	} // namespace			} // namespace
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

	int main(int argc, const char **argv) {			int main(int argc, const char **argv) {
	Show All 39 Lines