This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/COFF/
-
COFF/
-
Driver.cpp

Differential D105506

[LLD] [COFF] Avoid thread exhaustion on 32-bit Windows host
ClosedPublic

Authored by jeremyd2019 on Jul 6 2021, 12:28 PM.

Download Raw Diff

Details

Reviewers

mstorsjo
rnk
pcc

Commits

rG7a7da69fbe28: [LLD] [COFF] Avoid thread exhaustion on 32-bit Windows host

Summary

LLD on 32-bit Windows would frequently fail on large projects with an exception "thread constructor failed: Exec format error". The stack trace pointed to this usage of std::async, and looking at the implementation in libc++ it seems using std::async with std::launch::async results in the immediate creation of a new thread for every call. This could result in a potentially unbounded number of threads, depending on the number of input files. This seems to be hitting some limit in 32-bit Windows host.

I took the easy route, and only use threads on 64-bit Windows, not all Windows as before. I was thinking a more proper solution might involve using a thread pool rather than blindly spawning any number of new threads, but that may have other unforeseen consequences.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jeremyd2019 requested review of this revision.Jul 6 2021, 12:28 PM

jeremyd2019 created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2021, 12:28 PM

I do not have commit rights, and in fact this is my first attempt to submit a patch to llvm.

Adding @pcc who I think designed this performance optimization back in the day, in case he's still active and interested in commenting on it.

I think this might be a reasonably pragmatic fix for the issue. For the case we care most about, 64 bit, I don't think we would want to use a threadpool - using potentially unbounded number of threads might be pretty much intentional, to get as much parallelism as possible while waiting on IO. But for more limited cases like 32 bit mode, using a thread pool might be a reasonable approximation. This probably would be something of a performance degradation for 32 bit cases, while making it more robust - and spending more code on optimizing that case might not be worth it if it results in lots of code churn. But that depends on the scale of the degradation of course.

@jeremyd2019 - You wouldn't happen to have a reasonbly sized testcase to link lying around that you could try to run and see if the performance difference is measurable (i.e. a built project with e.g. lots of loose object files)? (Running lld often is so fast so it's hard to even measure anything unless you're linking something on the scale of Chromium, but I think this optimization is kinda vital.)

In D105506#2860558, @mstorsjo wrote:

@jeremyd2019 - You wouldn't happen to have a reasonbly sized testcase to link lying around that you could try to run and see if the performance difference is measurable (i.e. a built project with e.g. lots of loose object files)? (Running lld often is so fast so it's hard to even measure anything unless you're linking something on the scale of Chromium, but I think this optimization is kinda vital.)

I'm not quite sure what you are getting at. I know of several projects that crash before with 32-bit LLD but succeed after. Generally when being built with higher -j options.

In D105506#2860570, @jeremyd2019 wrote:

In D105506#2860558, @mstorsjo wrote:

@jeremyd2019 - You wouldn't happen to have a reasonbly sized testcase to link lying around that you could try to run and see if the performance difference is measurable (i.e. a built project with e.g. lots of loose object files)? (Running lld often is so fast so it's hard to even measure anything unless you're linking something on the scale of Chromium, but I think this optimization is kinda vital.)

I'm not quite sure what you are getting at. I know of several projects that crash before with 32-bit LLD but succeed after. Generally when being built with higher -j options.

Not a testcase for whether this crashes or not, but to see how much the suggested fix regresses linking performance for such projects that do succeed even before this patch.

In D105506#2860576, @mstorsjo wrote:

Not a testcase for whether this crashes or not, but to see how much the suggested fix regresses linking performance for such projects that do succeed even before this patch.

Hmm, I guess I could build qt6-base with the non-patched lld, let it fail, then manually run and time the link (which should succeed in isolation because it only tends to crash with parallel builds), and manually run and time the link with the patched lld.

In D105506#2860589, @jeremyd2019 wrote:

In D105506#2860576, @mstorsjo wrote:

Not a testcase for whether this crashes or not, but to see how much the suggested fix regresses linking performance for such projects that do succeed even before this patch.

Hmm, I guess I could build qt6-base with the non-patched lld, let it fail, then manually run and time the link (which should succeed in isolation because it only tends to crash with parallel builds), and manually run and time the link with the patched lld.

Thanks, that'd be awesome. Yeah a project on that scale hopefully is big enough where the difference is noticable.

FWIW, see https://github.com/llvm/llvm-project/commit/6ee0b4e9f543fa108b47d3ae7c030d835cacd2c1 for the original commit that added this feature, giving the scale of the speedup that it gave. So it might make the linking time around 30-100% longer than before, roughly, on large projects. Given how fast lld is otherwise I think that might be tolerable (and lld on 32 bit can't link huge projects anyway due to using memory mapping).

Harbormaster completed remote builds in B112671: Diff 356799.Jul 6 2021, 1:07 PM

I couldn't tell anything useful from that. Both patched and unpatched lld linked QT6Core.dll (which crashed LLD in a parallel build) in about 0.3 seconds.

In D105506#2860673, @jeremyd2019 wrote:

I couldn't tell anything useful from that. Both patched and unpatched lld linked QT6Core.dll (which crashed LLD in a parallel build) in about 0.3 seconds.

Ok, thanks for testing anyway. Based on the numbers from the old commit from @pcc, it'd be a worst case slowdown of 2x, but most actual cases you can link on 32 bit are so small anyway that you probably run into the memory mapping limit fairly soon anyway, so the cases that can be linked are small enough that it's plenty fast anyway.

So with that I think this is fine, but I'd still like an ack from @rnk.

Yes, I suspect that anything with enough files for this to make a noticeable performance difference would have crashed on 32-bit Windows either because of this or the 32-bit address space being exhausted.

32-bit builds of LLD are not well supported (see aforementioned address space issues), so sure, we can power this optimization down there.

This revision is now accepted and ready to land.Jul 7 2021, 7:27 AM

Great! @jeremyd2019, let me know your preferred git author line (Real Name <email@address>) to use for the commit when I push it?

In D105506#2861900, @mstorsjo wrote:

Great! @jeremyd2019, let me know your preferred git author line (Real Name <email@address>) to use for the commit when I push it?

I suppose "Jeremy Drake <github (at) jdrake.com>"

Closed by commit rG7a7da69fbe28: [LLD] [COFF] Avoid thread exhaustion on 32-bit Windows host (authored by jeremyd2019, committed by mstorsjo). · Explain WhyJul 7 2021, 12:01 PM

This revision was automatically updated to reflect the committed changes.

mstorsjo added a commit: rG7a7da69fbe28: [LLD] [COFF] Avoid thread exhaustion on 32-bit Windows host.

Revision Contents

Path

Size

lld/

COFF/

Driver.cpp

5 lines

Diff 357034

lld/COFF/Driver.cpp

	Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines
	// FIXME: We could open the file in createFutureForFile and avoid needing to			// FIXME: We could open the file in createFutureForFile and avoid needing to
	// return an error here, but for the moment that would cost us a file descriptor			// return an error here, but for the moment that would cost us a file descriptor
	// (a limited resource on Windows) for the duration that the future is pending.			// (a limited resource on Windows) for the duration that the future is pending.
	using MBErrPair = std::pair<std::unique_ptr<MemoryBuffer>, std::error_code>;			using MBErrPair = std::pair<std::unique_ptr<MemoryBuffer>, std::error_code>;

	// Create a std::future that opens and maps a file using the best strategy for			// Create a std::future that opens and maps a file using the best strategy for
	// the host platform.			// the host platform.
	static std::future<MBErrPair> createFutureForFile(std::string path) {			static std::future<MBErrPair> createFutureForFile(std::string path) {
	#if _WIN32			#if _WIN64
	// On Windows, file I/O is relatively slow so it is best to do this			// On Windows, file I/O is relatively slow so it is best to do this
	// asynchronously.			// asynchronously. But 32-bit has issues with potentially launching tons
				// of threads
	auto strategy = std::launch::async;			auto strategy = std::launch::async;
	#else			#else
	auto strategy = std::launch::deferred;			auto strategy = std::launch::deferred;
	#endif			#endif
	return std::async(strategy, [=]() {			return std::async(strategy, [=]() {
	auto mbOrErr = MemoryBuffer::getFile(path, /IsText=/false,			auto mbOrErr = MemoryBuffer::getFile(path, /IsText=/false,
	/RequiresNullTerminator=/false);			/RequiresNullTerminator=/false);
	if (!mbOrErr)			if (!mbOrErr)
	▲ Show 20 Lines • Show All 2,134 Lines • Show Last 20 Lines