This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/COFF/
-
COFF/
1
Driver.h
-
Driver.cpp
1
DriverUtils.cpp

Differential D78845

[COFF] Add a fastpath for /INCLUDE: in .drective sections
ClosedPublic

Authored by rnk on Apr 24 2020, 5:44 PM.

Download Raw Diff

Details

Reviewers

aganea
hans
thakis

Commits

rG01b5f521408d: [COFF] Add a fastpath for /INCLUDE: in .drective sections

Summary

This speeds up linking chrome.dll with PGO instrumentation by 13%
(154271ms -> 134033ms).

LLVM's Option library is very slow. In particular, it allocates at least
one large-ish heap object (Arg) for every argument. When PGO
instrumentation is enabled, all the __profd_* symbols are added to the
@llvm.used list, which compiles down to these /INCLUDE: directives. This
means we have O(#symbols) directives to parse in the section, so we end
up allocating an Arg for every function symbol in the object file. This
is unnecessary.

To address the issue and speed up the link, extend the fast path that we
already have for /EXPORT:, which has similar scaling issues.

I promise that I took a hard look at optimizing the Option library, but
its data structures are very general and would need a lot of cleanup. We
have accumulated lots of optional features (option groups, aliases,
multiple values) over the years, and these are now properties of every
parsed argument, when the vast majority of arguments do not use these
features.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rnk created this revision.Apr 24 2020, 5:44 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2020, 5:44 PM

Harbormaster failed remote builds in B54653: Diff 260041!Apr 24 2020, 6:58 PM

Nice!

This revision is now accepted and ready to land.Apr 24 2020, 7:12 PM

I promise that I took a hard look at optimizing the Option library, ... features. multiple values

Sigh, the only user of multiple values: https://reviews.llvm.org/D62070#2000628

As for LLVMOptions, what prevents a BumpAllocator + placement new on the Arg(s)? Or is the perf. wasted somewhere else?

Side-node: I was profiling the build-time regression between Clang 9, 10 & 11 on building LLVM and building a few of our games, with and without debug info. There's a severe regression, +10% CPU without debug info, and +15 to +18% with debug info, from Clang 9 to 10. Clang 11 adds an extra +2%. Additionally, no matter from what angle I look, allocations in clang take 10.5%-11% of the total CPU time (on Windows with the standard heap allocator). Replacing the allocator reduces that to 3.5% CPU for allocations, and improves some bandwidth/cache-sensitive functions along the way, which effectively reduce CPU usage by 15% (compared to baseline Clang 10). But at the same time it swipes the issue under the carpet. This all seems related to the amount of (small) allocations in LLVM generally.

lld/COFF/Driver.h
51	Curious: how many includes do you have per .drective? It is worth calling `.reserve()` somehow before inserting? MS-STL has geometric increase of the `std::vector` buffer. If your .drective has many tokens, we would probably allocate & move memory several times per .drective. I think `includes.reserve(tokensNum)`, `exports.reserve(tokensNum)` is better that the cost of re-alloc, even we're wasting a few extra memory. Unless you do a two-step parsing.
lld/COFF/DriverUtils.cpp
869	`tokenize()` allocates, ie. the returned `std::vector` doesn't `.reserve()` when constructed with a range (at least in MS STL 2019). That would be a good candidate for further optimization. Especially since you don't need the 'saver'.

In D78845#2003705, @aganea wrote:

As for LLVMOptions, what prevents a BumpAllocator + placement new on the Arg(s)? Or is the perf. wasted somewhere else?

I think that's a good first step. After that, I would focus on making the object leaner. It is huge.

The Option class is two pointers large, and we pass it by value. It needs a pointer to the parent option table so that it can implement getAliasedOption and getOptionGroup (sp). I think we could trim out the parent pointer by assuming that the option Info structs are laid out in an array indexed by option ID. Subtract the current option pointer by the ID, and then add back the ID of the option.

For Arg itself, I would suggest making all seldomly used fields members of TrailingObjects. This complicates the use of BumpPtrAllocator, but seems worth it.

Side-node: I was profiling the build-time regression between Clang 9, 10 & 11 on building LLVM and building a few of our games, with and without debug info. There's a severe regression, +10% CPU without debug info, and +15 to +18% with debug info, from Clang 9 to 10. Clang 11 adds an extra +2%. Additionally, no matter from what angle I look, allocations in clang take 10.5%-11% of the total CPU time (on Windows with the standard heap allocator). Replacing the allocator reduces that to 3.5% CPU for allocations, and improves some bandwidth/cache-sensitive functions along the way, which effectively reduce CPU usage by 15% (compared to baseline Clang 10). But at the same time it swipes the issue under the carpet. This all seems related to the amount of (small) allocations in LLVM generally.

I think the Google C++ production toolchain team noticed similar results and switched to tcmalloc to achieve the same thing.

I think early in LLVM project history, developers did a lot of micro-optimization focusing on reducing heap allocations (see prevalence (and overuse!) of SmallVector), and a lot of that has gone by the wayside as generic containers proliferate in new code.

However, I know that for the option library specifically, performance was not a priority because it was considered to only impact startup time, and therefore not worth optimizing. Reusing it for .drective parsing where throughput is important takes it outside of the original problem domain.

As a next step, I noticed that cl::TokenizeWindowsCommandLine copies all strings. ;_; When I initially wrote it, I had intended that it would only copy an argument in the case that it had to deal with quotations, but it looks like some developer has "helpfully" fixed a use after free by making it always copy. :(

Closed by commit rG01b5f521408d: [COFF] Add a fastpath for /INCLUDE: in .drective sections (authored by rnk). · Explain WhyApr 28 2020, 10:45 AM

This revision was automatically updated to reflect the committed changes.

In D78845#2008164, @rnk wrote:

As a next step, I noticed that cl::TokenizeWindowsCommandLine copies all strings. ;_; When I initially wrote it, I had intended that it would only copy an argument in the case that it had to deal with quotations, but it looks like some developer has "helpfully" fixed a use after free by making it always copy. :(

Nevermind, ignore that part of the comment. I just imagined that I had implemented this optimization. :) I wasn't able to implement this optimization because the Option library really wants to work with null-terminated strings, and we have to copy to get that null termination.

To avoid copying all strings during tokenization, we would need to start by relaxing that null terminated string requirement in the Option library. The string copies should already use BumpPtrAllocator, since they use StringSaver. This optimization should meaningfully reduce memory usage for PGO instrumented builds, because these strings currently live forever, and that is unnecessary.

In D78845#2008164, @rnk wrote:

I think early in LLVM project history, developers did a lot of micro-optimization focusing on reducing heap allocations (see prevalence (and overuse!) of SmallVector), and a lot of that has gone by the wayside as generic containers proliferate in new code.

I think the main issue is that there's no easy way to "see" how the allocations scale across LLVM in general. How many of them, where they are done, how many per sec., diff. against a previous build, etc.
We could implement a system to "tag" each allocation and save it along with the callstack. It would then easy to save binary snapshots and inspect where the allocations go, do diff-ing, etc.
It just takes someone that has time to do it ;-)

rnk mentioned this in D97585: [InstrProfiling] Use llvm.compiler.used instead of llvm.used for ELF.Mar 1 2021, 11:38 AM

thakis mentioned this in D113075: [lld-macho] Use separate tablegen file for LC_LINKER_OPTION.Nov 5 2021, 12:46 PM

MaskRay mentioned this in D130121: [3/3] [COFF] Emit embedded -exclude-symbols: directives for hidden visibility for MinGW.Aug 9 2022, 5:37 PM

Revision Contents

Path

Size

lld/

COFF/

Driver.h

14 lines

Driver.cpp

12 lines

DriverUtils.cpp

20 lines

Diff 260698

lld/COFF/Driver.h

	Show All 35 Lines
	using llvm::COFF::WindowsSubsystem;			using llvm::COFF::WindowsSubsystem;
	using llvm::Optional;			using llvm::Optional;

	class COFFOptTable : public llvm::opt::OptTable {			class COFFOptTable : public llvm::opt::OptTable {
	public:			public:
	COFFOptTable();			COFFOptTable();
	};			};

				// The result of parsing the .drective section. The /export: and /include:
				// options are handled separately because they reference symbols, and the number
				// of symbols can be quite large. The LLVM Option library will perform at least
				// one memory allocation per argument, and that is prohibitively slow for
				// parsing directives.
				struct ParsedDirectives {
				std::vector<StringRef> exports;
				std::vector<StringRef> includes;
				aganeaUnsubmitted Not Done Reply Inline Actions Curious: how many includes do you have per .drective? It is worth calling `.reserve()` somehow before inserting? MS-STL has geometric increase of the `std::vector` buffer. If your .drective has many tokens, we would probably allocate & move memory several times per .drective. I think `includes.reserve(tokensNum)`, `exports.reserve(tokensNum)` is better that the cost of re-alloc, even we're wasting a few extra memory. Unless you do a two-step parsing. aganea: Curious: how many includes do you have per .drective? It is worth calling `.reserve()` somehow…
				llvm::opt::InputArgList args;
				};

	class ArgParser {			class ArgParser {
	public:			public:
	// Parses command line options.			// Parses command line options.
	llvm::opt::InputArgList parse(llvm::ArrayRef<const char *> args);			llvm::opt::InputArgList parse(llvm::ArrayRef<const char *> args);

	// Tokenizes a given string and then parses as command line options.			// Tokenizes a given string and then parses as command line options.
	llvm::opt::InputArgList parse(StringRef s) { return parse(tokenize(s)); }			llvm::opt::InputArgList parse(StringRef s) { return parse(tokenize(s)); }

	// Tokenizes a given string and then parses as command line options in			// Tokenizes a given string and then parses as command line options in
	// .drectve section. /EXPORT options are returned in second element			// .drectve section. /EXPORT options are returned in second element
	// to be processed in fastpath.			// to be processed in fastpath.
	std::pair<llvm::opt::InputArgList, std::vector<StringRef>>			ParsedDirectives parseDirectives(StringRef s);
	parseDirectives(StringRef s);

	private:			private:
	// Concatenate LINK environment variable.			// Concatenate LINK environment variable.
	void addLINK(SmallVector<const char *, 256> &argv);			void addLINK(SmallVector<const char *, 256> &argv);

	std::vector<const char *> tokenize(StringRef s);			std::vector<const char *> tokenize(StringRef s);

	COFFOptTable table;			COFFOptTable table;
	▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

lld/COFF/Driver.cpp

Show First 20 Lines • Show All 337 Lines • ▼ Show 20 Lines	void LinkerDriver::parseDirectives(InputFile *file) {
if (s.empty())		if (s.empty())
return;		return;

log("Directives: " + toString(file) + ": " + s);		log("Directives: " + toString(file) + ": " + s);

ArgParser parser;		ArgParser parser;
// .drectve is always tokenized using Windows shell rules.		// .drectve is always tokenized using Windows shell rules.
// /EXPORT: option can appear too many times, processing in fastpath.		// /EXPORT: option can appear too many times, processing in fastpath.
opt::InputArgList args;		ParsedDirectives directives = parser.parseDirectives(s);
std::vector<StringRef> exports;
std::tie(args, exports) = parser.parseDirectives(s);

for (StringRef e : exports) {		for (StringRef e : directives.exports) {
// If a common header file contains dllexported function		// If a common header file contains dllexported function
// declarations, many object files may end up with having the		// declarations, many object files may end up with having the
// same /EXPORT options. In order to save cost of parsing them,		// same /EXPORT options. In order to save cost of parsing them,
// we dedup them first.		// we dedup them first.
if (!directivesExports.insert(e).second)		if (!directivesExports.insert(e).second)
continue;		continue;

Export exp = parseExport(e);		Export exp = parseExport(e);
if (config->machine == I386 && config->mingw) {		if (config->machine == I386 && config->mingw) {
if (!isDecorated(exp.name))		if (!isDecorated(exp.name))
exp.name = saver.save("_" + exp.name);		exp.name = saver.save("_" + exp.name);
if (!exp.extName.empty() && !isDecorated(exp.extName))		if (!exp.extName.empty() && !isDecorated(exp.extName))
exp.extName = saver.save("_" + exp.extName);		exp.extName = saver.save("_" + exp.extName);
}		}
exp.directives = true;		exp.directives = true;
config->exports.push_back(exp);		config->exports.push_back(exp);
}		}

for (auto *arg : args) {		// Handle /include: in bulk.
		for (StringRef inc : directives.includes)
		addUndefined(inc);

		for (auto *arg : directives.args) {
switch (arg->getOption().getID()) {		switch (arg->getOption().getID()) {
case OPT_aligncomm:		case OPT_aligncomm:
parseAligncomm(arg->getValue());		parseAligncomm(arg->getValue());
break;		break;
case OPT_alternatename:		case OPT_alternatename:
parseAlternateName(arg->getValue());		parseAlternateName(arg->getValue());
break;		break;
case OPT_defaultlib:		case OPT_defaultlib:
▲ Show 20 Lines • Show All 1,662 Lines • Show Last 20 Lines

lld/COFF/DriverUtils.cpp

Show First 20 Lines • Show All 856 Lines • ▼ Show 20 Lines	opt::InputArgList ArgParser::parse(ArrayRef<const char *> argv) {
if (args.hasArg(OPT_lib))		if (args.hasArg(OPT_lib))
warn("ignoring /lib since it's not the first argument");		warn("ignoring /lib since it's not the first argument");

return args;		return args;
}		}

// Tokenizes and parses a given string as command line in .drective section.		// Tokenizes and parses a given string as command line in .drective section.
// /EXPORT options are processed in fastpath.		// /EXPORT options are processed in fastpath.
std::pair<opt::InputArgList, std::vector<StringRef>>		ParsedDirectives ArgParser::parseDirectives(StringRef s) {
ArgParser::parseDirectives(StringRef s) {		ParsedDirectives result;
std::vector<StringRef> exports;
SmallVector<const char *, 16> rest;		SmallVector<const char *, 16> rest;

for (StringRef tok : tokenize(s)) {		for (StringRef tok : tokenize(s)) {
		aganeaUnsubmitted Not Done Reply Inline Actions `tokenize()` allocates, ie. the returned `std::vector` doesn't `.reserve()` when constructed with a range (at least in MS STL 2019). That would be a good candidate for further optimization. Especially since you don't need the 'saver'. aganea: `tokenize()` allocates, ie. the returned `std::vector` doesn't `.reserve()` when constructed…
if (tok.startswith_lower("/export:") \|\| tok.startswith_lower("-export:"))		if (tok.startswith_lower("/export:") \|\| tok.startswith_lower("-export:"))
exports.push_back(tok.substr(strlen("/export:")));		result.exports.push_back(tok.substr(strlen("/export:")));
		else if (tok.startswith_lower("/include:") \|\|
		tok.startswith_lower("-include:"))
		result.includes.push_back(tok.substr(strlen("/include:")));
else		else
rest.push_back(tok.data());		rest.push_back(tok.data());
}		}

// Make InputArgList from unparsed string vectors.		// Make InputArgList from unparsed string vectors.
unsigned missingIndex;		unsigned missingIndex;
unsigned missingCount;		unsigned missingCount;

opt::InputArgList args = table.ParseArgs(rest, missingIndex, missingCount);		result.args = table.ParseArgs(rest, missingIndex, missingCount);

if (missingCount)		if (missingCount)
fatal(Twine(args.getArgString(missingIndex)) + ": missing argument");		fatal(Twine(result.args.getArgString(missingIndex)) + ": missing argument");
for (auto *arg : args.filtered(OPT_UNKNOWN))		for (auto *arg : result.args.filtered(OPT_UNKNOWN))
warn("ignoring unknown argument: " + arg->getAsString(args));		warn("ignoring unknown argument: " + arg->getAsString(result.args));
return {std::move(args), std::move(exports)};		return result;
}		}

// link.exe has an interesting feature. If LINK or _LINK_ environment		// link.exe has an interesting feature. If LINK or _LINK_ environment
// variables exist, their contents are handled as command line strings.		// variables exist, their contents are handled as command line strings.
// So you can pass extra arguments using them.		// So you can pass extra arguments using them.
void ArgParser::addLINK(SmallVector<const char *, 256> &argv) {		void ArgParser::addLINK(SmallVector<const char *, 256> &argv) {
// Concatenate LINK env and command line arguments, and then parse them.		// Concatenate LINK env and command line arguments, and then parse them.
if (Optional<std::string> s = Process::GetEnv("LINK")) {		if (Optional<std::string> s = Process::GetEnv("LINK")) {
Show All 23 Lines