This reduces linking time by ~8% for my project (1.19s -> 0.53s for writeSections()). writeTo is const, which bodes well for it being parallelizable, and I've looked through the different overridden versions and can't see any race conditions. It produces the same byte-for-byte output for my project. I can do another look through the functions if people are in favor of this patch.
Details
- Reviewers
int3 thakis oontvoo - Group Reviewers
Restricted Project - Commits
- rG44978a234b8e: [lld/mac] Write output sections in parallel
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Nice. Do you have some perf numbers + the size of the workload you're benchmarking against?
Would still love to have numbers on the workload size if possible.
I've looked through the different overridden versions and can't see any race conditions.
Yes, writeTo should be synchronization-free by design. I don't remember why we hadn't called in parallel to begin with...
@int3 it's about 4,000,000 symbols IIRC, if that helps (this is a debug build of the main Uber app). Were there other numbers you wanted?
FYI before we merge, I need run clang-format on this to make sure there are no issues
lld/MachO/Writer.cpp | ||
---|---|---|
1083–1092 | Does this not introduce indeterminism in the final binaries, and hence mess up caching? |
Size of output binary + number of cores would also be good to know.
FYI before we merge, I need run clang-format on this to make sure there are no issues
Sounds good, will wait for that to land this
lld/MachO/Writer.cpp | ||
---|---|---|
1083–1092 | there's no sorting going on here though |
lld/MachO/Writer.cpp | ||
---|---|---|
1086 | using a range insert is more efficient |
lld/MachO/Writer.cpp | ||
---|---|---|
1083–1092 | NVM - resolved :) |
@int3 the binary is 1.7GB, __TEXT segment 508MB. I'm on an M1 macbook pro with 10 cores. I've run clang-format and no changes were needed. I'll be running it on future revisions from now on.
(sorry for the churn :/)
lld/MachO/Writer.cpp | ||
---|---|---|
1090 | nit: I think we usually use the append_range helper: https://cs.github.com/llvm/llvm-project?q=append_range |
lld/MachO/Writer.cpp | ||
---|---|---|
1090 | Actually, let me just fix this up while landing this. |
I patched this in and ran ninja check-lld and this breaks tests:
FAIL: lld :: MachO/invalid/range-check.s (804 of 2631) ******************** TEST 'lld :: MachO/invalid/range-check.s' FAILED ******************** Script: -- : 'RUN: at line 3'; rm -rf /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp; split-file /Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp : 'RUN: at line 4'; /Users/thakis/src/llvm-project/out/gn/bin/llvm-mc -filetype=obj -triple=x86_64-apple-darwin /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.s -o /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o : 'RUN: at line 5'; /Users/thakis/src/llvm-project/out/gn/bin/llvm-mc -filetype=obj -triple=x86_64-apple-darwin /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/bar.s -o /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/bar.o : 'RUN: at line 6'; ld64.lld -arch x86_64 -platform_version macos 11.0 11.0 -syslibroot /Users/thakis/src/llvm-project/lld/test/MachO/Inputs/MacOSX.sdk -fatal_warnings -dylib /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/bar.o -o /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/libbar.dylib : 'RUN: at line 7'; not ld64.lld -arch x86_64 -platform_version macos 11.0 11.0 -syslibroot /Users/thakis/src/llvm-project/lld/test/MachO/Inputs/MacOSX.sdk -fatal_warnings -lSystem -o /dev/null /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/libbar.dylib /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o 2>&1 | /Users/thakis/src/llvm-project/out/gn/bin/FileCheck /Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s -- Exit Code: 1 Command Output (stderr): -- /Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s:11:10: error: CHECK: expected string not found in input # CHECK: error: stub is out of range: [[#]] is not in [-2147483648, 2147483647]; references _bar ^ <stdin>:5:235: note: scanning from here ld64.lld: error: /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: 4294974025 is not in [-2147483648, 2147483647]; references _foo ^ Input file: <stdin> Check file: /Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s -dump-input=help explains the following input dump. Input was: <<<<<< 1: ld64.lld: error: stub helper header is out of range: 4294973997 is not in [-2147483648, 2147483647] 2: ld64.lld: error: stub helper header is out of range: 4294969885 is not in [-2147483648, 2147483647] 3: ld64.lld: error: stub is out of range: 4294973998 is not in [-2147483648, 2147483647]; references _bar 4: ld64.lld: error: /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o:(symbol _main+0xd): relocation UNSIGNED is out of range: 8589942792 is not in [0, 4294967295]; references _foo 5: ld64.lld: error: /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: 4294974025 is not in [-2147483648, 2147483647]; references _foo check:11 X error: no match found >>>>>> -- ******************** ******************** Failed Tests (1): lld :: MachO/invalid/range-check.s
(This is on an M1 mac in case it matters. The test fails both with the patch as-is as well as with append_range. My local diff for the latter, in case you want to apply it on your end, is:
% git diff diff --git a/lld/MachO/Writer.cpp b/lld/MachO/Writer.cpp index 14d870144390..a485ba97d8d3 100644 --- a/lld/MachO/Writer.cpp +++ b/lld/MachO/Writer.cpp @@ -1083,10 +1083,8 @@ void Writer::openFile() { void Writer::writeSections() { uint8_t *buf = buffer->getBufferStart(); std::vector<const OutputSection *> osecs; - for (const OutputSegment *seg : outputSegments) { - const std::vector<OutputSection *> §ions = seg->getSections(); - osecs.insert(osecs.end(), sections.begin(), sections.end()); - } + for (const OutputSegment *seg : outputSegments) + append_range(osecs, seg->getSections()); parallelForEach(osecs.begin(), osecs.end(), [&](const OutputSection *osec) { osec->writeTo(buf + osec->fileOff);
)
(Marking as "Request Changes" so it disappears from my "Must Review" list. Please ping once this is addressed -- probably have to put diags emitted by osec::writeTo() into a (mutexed) vector and then print the contents of that vector sorted after, instead of outputting diags as they happen, to make sure diag output stays deterministic.)
@thakis so it seems like we have some options:
- collect the diagnostics emitted in all of those writeTo() functions into a mutexed vector (does this mean making new versions of helper functions that they call, like checkInt? it seems like a lot of work potentially)
- have an override in the error handler to buffer output temporarily
- in the test code, sort the actual output, and compare it to the expected output sorted
what would you prefer?
sketch of #2:
diff --git a/lld/Common/ErrorHandler.cpp b/lld/Common/ErrorHandler.cpp index 4cacd82c9f35..0f3fe8ce843d 100644 --- a/lld/Common/ErrorHandler.cpp +++ b/lld/Common/ErrorHandler.cpp @@ -47,6 +47,7 @@ void ErrorHandler::initialize(llvm::raw_ostream &stdoutOS, void ErrorHandler::flushStreams() { std::lock_guard<std::mutex> lock(mu); + // flush buffered streams here outs().flush(); errs().flush(); } @@ -75,9 +76,21 @@ raw_ostream &lld::errs() { return e.errs(); } + +void ErrorHandler::beginBufferingOutput() { + bufferOutput = true; +} + +void ErrorHandler::finishBufferingOutput() { + // have each buffer write to its corresponding output stream + bufferOutput = false; +} + raw_ostream &ErrorHandler::outs() { if (disableOutput) return llvm::nulls(); + if (bufferOutput) + return bufferedOutWrites; return stdoutOS ? *stdoutOS : llvm::outs(); } diff --git a/lld/include/lld/Common/ErrorHandler.h b/lld/include/lld/Common/ErrorHandler.h index 0ba4787e5888..56f9edc4c7be 100644 --- a/lld/include/lld/Common/ErrorHandler.h +++ b/lld/include/lld/Common/ErrorHandler.h @@ -115,6 +115,8 @@ public: raw_ostream &outs(); raw_ostream &errs(); + void beginBufferingOutput(); + void finishBufferingOutput(); void flushStreams(); std::unique_ptr<llvm::FileOutputBuffer> outputBuffer; @@ -138,6 +140,27 @@ private: std::mutex mu; llvm::raw_ostream *stdoutOS{}; llvm::raw_ostream *stderrOS{}; + + bool bufferOutput = false; + class MyStream : llvm::raw_ostream { + std::vector<std::string> writes; + void write_impl(const char *Ptr, size_t Size) override { + writes.push_back(std::string(Ptr, Size)); + } + uint64_t current_pos() const override { + return 0; + } + std::string sortedWrites() { + std::sort(writes.begin(), writes.end()); + std::string sortedWrites; + for (const std::string &write : writes) { + sortedWrites.append(write); + } + return sortedWrites; + } + }; + MyStream bufferedOutWrites; + MyStream bufferedErrWrites; }; /// Returns the default error handler.
@thakis does diag output *have* to be deterministic? The actual output binary should be deterministic, of course, but I don't see why the diag errors need to be
I'm thinking that we could just update the test to use CHECK-DAG...
I think I prefer 1.
All observable behavior has to be deterministic IMHO. Else you can't cache the output.
@thakis how would #1 look exactly? For example, in StubsSection::writeTo, it appears to end up in inline void writeStub(uint8_t *buf8, const uint32_t stubCode[3], const macho::Symbol &sym), which calls encodePage21, which calls checkInt. Would we make a separate version of encodePage21 and duplicate the logic, where this new version takes a mutex and a vector along with the usual arguments, and instead of calling normal checkInt, it calls our own version that would insert into the vector? Do we duplicate the logic of reportRangeError? And would our vector also store the type of message reported, whether it was a warning, error, etc.? What about for fatals, if we happen to find any of those, should we in that case terminate early, and if so, how would that work? With these questions in mind (assuming you don't have some other idea in mind of how to do this), I'm curious what your objections to #2 would be. Also, it seems like #2 would be helpful in any future cases where we want deterministic logging from concurrent tasks, such as if file loading is parallelized.
All observable behavior has to be deterministic IMHO. Else you can't cache the output.
Does your build cache the diag outputs as well as the binary outputs?
I haven't looked in detail, but what I would've done is:
- Have some DetermisticParallelDiags class that has warning(), error() methods, and internally a vector to store each
- And a flush() method to make it sort and emit the things it stores
- Have a global object that optionally points to such an object
- Make it point to such an object while we write output sections
This is similar to your 2, but built on top of the existing error messages instead of inside it (which kinda feels more modular to me).
This is just my opinion ofc, if y'all thing this is silly, feel free to ignore!
Does your build cache the diag outputs as well as the binary outputs?
I think it doesn't matter what our build cache currenty does. Running the same command twice with the same inputs and getting different outputs and getting different output seems Generally Bad, does it not?
(OTOH, if you build with ninja and the build fails, it also doesn't guarantee that errors appear in the same order every time. But for lld, it seems fairly easy to fix, and fixing it if it's easy seems like we should do just do it (?))
I think I changed my mind here. If Jez think it's fine, I suppose it's fine to have non-successful links have non-deterministic diags. So maybe just -DAG'ing the test is fine.
And if we decide it's not fine, then we should make the diag output deterministic again in a follow-up and not as part of this change. So relanding with tweaked test LGTM, actually.
Worth noting this is not the only place where we have non-deterministic ordering of diag messages. There's one here too (which you'd requested changing but I hadn't got to yet :) )
This one's kind of worse as it's a warning imho -- so here we have nondeterministic output on a _successful_ link.
@thakis any action items for me? It sounds like you're going to do the append_range change. Also, we can remove the #include I added, since it looks like it's already present further down. And lastly, I tested with the following patch as you specified and tests pass:
diff --git a/lld/test/MachO/invalid/range-check.s b/lld/test/MachO/invalid/range-check.s index 2f087e38c266..1ad719cfa31d 100644 --- a/lld/test/MachO/invalid/range-check.s +++ b/lld/test/MachO/invalid/range-check.s @@ -6,11 +6,11 @@ # RUN: %lld -dylib %t/bar.o -o %t/libbar.dylib # RUN: not %lld -lSystem -o /dev/null %t/libbar.dylib %t/test.o 2>&1 | FileCheck %s -# CHECK: error: {{.*}}test.o:(symbol _main+0xd): relocation UNSIGNED is out of range: [[#]] is not in [0, 4294967295]; references _foo -# CHECK: error: {{.*}}test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: [[#]] is not in [-2147483648, 2147483647]; references _foo -# CHECK: error: stub is out of range: [[#]] is not in [-2147483648, 2147483647]; references _bar -# CHECK: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647] -# CHECK: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647] +# CHECK-DAG: error: {{.*}}test.o:(symbol _main+0xd): relocation UNSIGNED is out of range: [[#]] is not in [0, 4294967295]; references _foo +# CHECK-DAG: error: {{.*}}test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: [[#]] is not in [-2147483648, 2147483647]; references _foo +# CHECK-DAG: error: stub is out of range: [[#]] is not in [-2147483648, 2147483647]; references _bar +# CHECK-DAG: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647] +# CHECK-DAG: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647]
using a range insert is more efficient