This is an archive of the discontinued LLVM Phabricator instance.

Write output sections in parallel
ClosedPublic

Authored by michaeleisel on Jun 1 2022, 11:52 AM.

Details

Reviewers
int3
thakis
oontvoo
Group Reviewers
Restricted Project
Commits
rG44978a234b8e: [lld/mac] Write output sections in parallel
Summary

This reduces linking time by ~8% for my project (1.19s -> 0.53s for writeSections()). writeTo is const, which bodes well for it being parallelizable, and I've looked through the different overridden versions and can't see any race conditions. It produces the same byte-for-byte output for my project. I can do another look through the functions if people are in favor of this patch.

Diff Detail

Event Timeline

michaeleisel created this revision.Jun 1 2022, 11:52 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 1 2022, 11:52 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
michaeleisel requested review of this revision.Jun 1 2022, 11:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 1 2022, 11:52 AM
michaeleisel edited the summary of this revision. (Show Details)Jun 1 2022, 11:55 AM
int3 added a subscriber: int3.Jun 1 2022, 11:56 AM

Nice. Do you have some perf numbers + the size of the workload you're benchmarking against?

int3 added a comment.Jun 1 2022, 11:56 AM

I think I commented before the edit :)

int3 accepted this revision.Jun 1 2022, 11:58 AM

Would still love to have numbers on the workload size if possible.

I've looked through the different overridden versions and can't see any race conditions.

Yes, writeTo should be synchronization-free by design. I don't remember why we hadn't called in parallel to begin with...

This revision is now accepted and ready to land.Jun 1 2022, 11:58 AM

@int3 it's about 4,000,000 symbols IIRC, if that helps (this is a debug build of the main Uber app). Were there other numbers you wanted?

FYI before we merge, I need run clang-format on this to make sure there are no issues

oontvoo requested changes to this revision.Jun 1 2022, 12:09 PM
oontvoo added a subscriber: oontvoo.
oontvoo added inline comments.
lld/MachO/Writer.cpp
1083–1092

Does this not introduce indeterminism in the final binaries, and hence mess up caching?
(specifically the ordering of the output sections may change even if the input does not..)

This revision now requires changes to proceed.Jun 1 2022, 12:09 PM
int3 added a comment.Jun 1 2022, 12:11 PM

Size of output binary + number of cores would also be good to know.

FYI before we merge, I need run clang-format on this to make sure there are no issues

Sounds good, will wait for that to land this

lld/MachO/Writer.cpp
1083–1092

there's no sorting going on here though

MaskRay added inline comments.
lld/MachO/Writer.cpp
1086

using a range insert is more efficient

oontvoo added inline comments.Jun 1 2022, 12:30 PM
lld/MachO/Writer.cpp
1083–1092

NVM - resolved :)

@int3 the binary is 1.7GB, __TEXT segment 508MB. I'm on an M1 macbook pro with 10 cores. I've run clang-format and no changes were needed. I'll be running it on future revisions from now on.

Use range-based insertion for output sections

thakis added a subscriber: thakis.Jun 1 2022, 4:11 PM

(sorry for the churn :/)

lld/MachO/Writer.cpp
1090

nit: I think we usually use the append_range helper: https://cs.github.com/llvm/llvm-project?q=append_range

thakis added inline comments.Jun 1 2022, 4:12 PM
lld/MachO/Writer.cpp
1090

Actually, let me just fix this up while landing this.

thakis added a comment.Jun 1 2022, 4:15 PM

I patched this in and ran ninja check-lld and this breaks tests:

FAIL: lld :: MachO/invalid/range-check.s (804 of 2631)
******************** TEST 'lld :: MachO/invalid/range-check.s' FAILED ********************
Script:
--
: 'RUN: at line 3';   rm -rf /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp; split-file /Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp
: 'RUN: at line 4';   /Users/thakis/src/llvm-project/out/gn/bin/llvm-mc -filetype=obj -triple=x86_64-apple-darwin /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.s -o /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o
: 'RUN: at line 5';   /Users/thakis/src/llvm-project/out/gn/bin/llvm-mc -filetype=obj -triple=x86_64-apple-darwin /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/bar.s -o /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/bar.o
: 'RUN: at line 6';   ld64.lld -arch x86_64 -platform_version macos 11.0 11.0 -syslibroot /Users/thakis/src/llvm-project/lld/test/MachO/Inputs/MacOSX.sdk -fatal_warnings -dylib /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/bar.o -o /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/libbar.dylib
: 'RUN: at line 7';   not ld64.lld -arch x86_64 -platform_version macos 11.0 11.0 -syslibroot /Users/thakis/src/llvm-project/lld/test/MachO/Inputs/MacOSX.sdk -fatal_warnings -lSystem -o /dev/null /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/libbar.dylib /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o 2>&1 | /Users/thakis/src/llvm-project/out/gn/bin/FileCheck /Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s
--
Exit Code: 1

Command Output (stderr):
--
/Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s:11:10: error: CHECK: expected string not found in input
# CHECK: error: stub is out of range: [[#]] is not in [-2147483648, 2147483647]; references _bar
         ^
<stdin>:5:235: note: scanning from here
ld64.lld: error: /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: 4294974025 is not in [-2147483648, 2147483647]; references _foo
                                                                                                                                                                                                                                          ^

Input file: <stdin>
Check file: /Users/thakis/src/llvm-project/lld/test/MachO/invalid/range-check.s

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          1: ld64.lld: error: stub helper header is out of range: 4294973997 is not in [-2147483648, 2147483647] 
          2: ld64.lld: error: stub helper header is out of range: 4294969885 is not in [-2147483648, 2147483647] 
          3: ld64.lld: error: stub is out of range: 4294973998 is not in [-2147483648, 2147483647]; references _bar 
          4: ld64.lld: error: /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o:(symbol _main+0xd): relocation UNSIGNED is out of range: 8589942792 is not in [0, 4294967295]; references _foo 
          5: ld64.lld: error: /Users/thakis/src/llvm-project/out/gn/obj/lld/test/MachO/invalid/Output/range-check.s.tmp/test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: 4294974025 is not in [-2147483648, 2147483647]; references _foo 
check:11                                                                                                                                                                                                                                               X error: no match found
>>>>>>

--

********************
********************
Failed Tests (1):
  lld :: MachO/invalid/range-check.s
thakis added a comment.Jun 1 2022, 4:17 PM

(This is on an M1 mac in case it matters. The test fails both with the patch as-is as well as with append_range. My local diff for the latter, in case you want to apply it on your end, is:

% git diff
diff --git a/lld/MachO/Writer.cpp b/lld/MachO/Writer.cpp
index 14d870144390..a485ba97d8d3 100644
--- a/lld/MachO/Writer.cpp
+++ b/lld/MachO/Writer.cpp
@@ -1083,10 +1083,8 @@ void Writer::openFile() {
 void Writer::writeSections() {
   uint8_t *buf = buffer->getBufferStart();
   std::vector<const OutputSection *> osecs;
-  for (const OutputSegment *seg : outputSegments) {
-    const std::vector<OutputSection *> &sections = seg->getSections();
-    osecs.insert(osecs.end(), sections.begin(), sections.end());
-  }
+  for (const OutputSegment *seg : outputSegments)
+    append_range(osecs, seg->getSections());
 
   parallelForEach(osecs.begin(), osecs.end(), [&](const OutputSection *osec) {
     osec->writeTo(buf + osec->fileOff);

)

thakis requested changes to this revision.Jun 3 2022, 8:51 AM

(Marking as "Request Changes" so it disappears from my "Must Review" list. Please ping once this is addressed -- probably have to put diags emitted by osec::writeTo() into a (mutexed) vector and then print the contents of that vector sorted after, instead of outputting diags as they happen, to make sure diag output stays deterministic.)

This revision now requires changes to proceed.Jun 3 2022, 8:51 AM

@thakis so it seems like we have some options:

  • collect the diagnostics emitted in all of those writeTo() functions into a mutexed vector (does this mean making new versions of helper functions that they call, like checkInt? it seems like a lot of work potentially)
  • have an override in the error handler to buffer output temporarily
  • in the test code, sort the actual output, and compare it to the expected output sorted

what would you prefer?

sketch of #2:

diff --git a/lld/Common/ErrorHandler.cpp b/lld/Common/ErrorHandler.cpp
index 4cacd82c9f35..0f3fe8ce843d 100644
--- a/lld/Common/ErrorHandler.cpp
+++ b/lld/Common/ErrorHandler.cpp
@@ -47,6 +47,7 @@ void ErrorHandler::initialize(llvm::raw_ostream &stdoutOS,
 
 void ErrorHandler::flushStreams() {
   std::lock_guard<std::mutex> lock(mu);
+  // flush buffered streams here
   outs().flush();
   errs().flush();
 }
@@ -75,9 +76,21 @@ raw_ostream &lld::errs() {
   return e.errs();
 }
 
+
+void ErrorHandler::beginBufferingOutput() {
+  bufferOutput = true;
+}
+
+void ErrorHandler::finishBufferingOutput() {
+  // have each buffer write to its corresponding output stream
+  bufferOutput = false;
+}
+
 raw_ostream &ErrorHandler::outs() {
   if (disableOutput)
     return llvm::nulls();
+  if (bufferOutput)
+    return bufferedOutWrites;
   return stdoutOS ? *stdoutOS : llvm::outs();
 }
 
diff --git a/lld/include/lld/Common/ErrorHandler.h b/lld/include/lld/Common/ErrorHandler.h
index 0ba4787e5888..56f9edc4c7be 100644
--- a/lld/include/lld/Common/ErrorHandler.h
+++ b/lld/include/lld/Common/ErrorHandler.h
@@ -115,6 +115,8 @@ public:
 
   raw_ostream &outs();
   raw_ostream &errs();
+  void beginBufferingOutput();
+  void finishBufferingOutput();
   void flushStreams();
 
   std::unique_ptr<llvm::FileOutputBuffer> outputBuffer;
@@ -138,6 +140,27 @@ private:
   std::mutex mu;
   llvm::raw_ostream *stdoutOS{};
   llvm::raw_ostream *stderrOS{};
+
+  bool bufferOutput = false;
+  class MyStream : llvm::raw_ostream {
+    std::vector<std::string> writes;
+    void write_impl(const char *Ptr, size_t Size) override {
+      writes.push_back(std::string(Ptr, Size));
+    }
+    uint64_t current_pos() const override {
+      return 0;
+    }
+    std::string sortedWrites() {
+      std::sort(writes.begin(), writes.end());
+      std::string sortedWrites;
+      for (const std::string &write : writes) {
+        sortedWrites.append(write);
+      }
+      return sortedWrites;
+    }
+  };
+  MyStream bufferedOutWrites;
+  MyStream bufferedErrWrites;
 };
 
 /// Returns the default error handler.
int3 added a comment.Jun 6 2022, 12:59 PM

@thakis does diag output *have* to be deterministic? The actual output binary should be deterministic, of course, but I don't see why the diag errors need to be

I'm thinking that we could just update the test to use CHECK-DAG...

thakis added a comment.Jun 7 2022, 6:38 PM

I think I prefer 1.

All observable behavior has to be deterministic IMHO. Else you can't cache the output.

@thakis how would #1 look exactly? For example, in StubsSection::writeTo, it appears to end up in inline void writeStub(uint8_t *buf8, const uint32_t stubCode[3], const macho::Symbol &sym), which calls encodePage21, which calls checkInt. Would we make a separate version of encodePage21 and duplicate the logic, where this new version takes a mutex and a vector along with the usual arguments, and instead of calling normal checkInt, it calls our own version that would insert into the vector? Do we duplicate the logic of reportRangeError? And would our vector also store the type of message reported, whether it was a warning, error, etc.? What about for fatals, if we happen to find any of those, should we in that case terminate early, and if so, how would that work? With these questions in mind (assuming you don't have some other idea in mind of how to do this), I'm curious what your objections to #2 would be. Also, it seems like #2 would be helpful in any future cases where we want deterministic logging from concurrent tasks, such as if file loading is parallelized.

int3 added a comment.Jun 8 2022, 9:57 AM

All observable behavior has to be deterministic IMHO. Else you can't cache the output.

Does your build cache the diag outputs as well as the binary outputs?

@thakis how would #1 look exactly? For example, in StubsSection::writeTo, it appears to end up in inline void writeStub(uint8_t *buf8, const uint32_t stubCode[3], const macho::Symbol &sym), which calls encodePage21, which calls checkInt. Would we make a separate version of encodePage21 and duplicate the logic, where this new version takes a mutex and a vector along with the usual arguments, and instead of calling normal checkInt, it calls our own version that would insert into the vector? Do we duplicate the logic of reportRangeError? And would our vector also store the type of message reported, whether it was a warning, error, etc.? What about for fatals, if we happen to find any of those, should we in that case terminate early, and if so, how would that work? With these questions in mind (assuming you don't have some other idea in mind of how to do this), I'm curious what your objections to #2 would be. Also, it seems like #2 would be helpful in any future cases where we want deterministic logging from concurrent tasks, such as if file loading is parallelized.

I haven't looked in detail, but what I would've done is:

  • Have some DetermisticParallelDiags class that has warning(), error() methods, and internally a vector to store each
  • And a flush() method to make it sort and emit the things it stores
  • Have a global object that optionally points to such an object
  • Make it point to such an object while we write output sections

This is similar to your 2, but built on top of the existing error messages instead of inside it (which kinda feels more modular to me).

This is just my opinion ofc, if y'all thing this is silly, feel free to ignore!

Does your build cache the diag outputs as well as the binary outputs?

I think it doesn't matter what our build cache currenty does. Running the same command twice with the same inputs and getting different outputs and getting different output seems Generally Bad, does it not?

(OTOH, if you build with ninja and the build fails, it also doesn't guarantee that errors appear in the same order every time. But for lld, it seems fairly easy to fix, and fixing it if it's easy seems like we should do just do it (?))

thakis accepted this revision.Jun 8 2022, 11:09 AM

I think I changed my mind here. If Jez think it's fine, I suppose it's fine to have non-successful links have non-deterministic diags. So maybe just -DAG'ing the test is fine.

And if we decide it's not fine, then we should make the diag output deterministic again in a follow-up and not as part of this change. So relanding with tweaked test LGTM, actually.

This revision now requires review to proceed.Jun 8 2022, 11:09 AM

I think I changed my mind here. If Jez think it's fine, I suppose it's fine to have non-successful links have non-deterministic diags. So maybe just -DAG'ing the test is fine.

And if we decide it's not fine, then we should make the diag output deterministic again in a follow-up and not as part of this change. So relanding with tweaked test LGTM, actually.

Worth noting this is not the only place where we have non-deterministic ordering of diag messages. There's one here too (which you'd requested changing but I hadn't got to yet :) )

oontvoo resigned from this revision.Jun 8 2022, 11:15 AM

(rescinded my previous request-changes)

This revision is now accepted and ready to land.Jun 8 2022, 11:15 AM

I think I changed my mind here. If Jez think it's fine, I suppose it's fine to have non-successful links have non-deterministic diags. So maybe just -DAG'ing the test is fine.

And if we decide it's not fine, then we should make the diag output deterministic again in a follow-up and not as part of this change. So relanding with tweaked test LGTM, actually.

Worth noting this is not the only place where we have non-deterministic ordering of diag messages. There's one here too (which you'd requested changing but I hadn't got to yet :) )

This one's kind of worse as it's a warning imho -- so here we have nondeterministic output on a _successful_ link.

@thakis any action items for me? It sounds like you're going to do the append_range change. Also, we can remove the #include I added, since it looks like it's already present further down. And lastly, I tested with the following patch as you specified and tests pass:

diff --git a/lld/test/MachO/invalid/range-check.s b/lld/test/MachO/invalid/range-check.s
index 2f087e38c266..1ad719cfa31d 100644
--- a/lld/test/MachO/invalid/range-check.s
+++ b/lld/test/MachO/invalid/range-check.s
@@ -6,11 +6,11 @@
 # RUN: %lld -dylib %t/bar.o -o %t/libbar.dylib
 # RUN: not %lld -lSystem -o /dev/null %t/libbar.dylib %t/test.o 2>&1 | FileCheck %s
 
-# CHECK: error: {{.*}}test.o:(symbol _main+0xd): relocation UNSIGNED is out of range: [[#]] is not in [0, 4294967295]; references _foo
-# CHECK: error: {{.*}}test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: [[#]] is not in [-2147483648, 2147483647]; references _foo
-# CHECK: error: stub is out of range: [[#]] is not in [-2147483648, 2147483647]; references _bar
-# CHECK: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647]
-# CHECK: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647]
+# CHECK-DAG: error: {{.*}}test.o:(symbol _main+0xd): relocation UNSIGNED is out of range: [[#]] is not in [0, 4294967295]; references _foo
+# CHECK-DAG: error: {{.*}}test.o:(symbol _main+0x3): relocation GOT_LOAD is out of range: [[#]] is not in [-2147483648, 2147483647]; references _foo
+# CHECK-DAG: error: stub is out of range: [[#]] is not in [-2147483648, 2147483647]; references _bar
+# CHECK-DAG: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647]
+# CHECK-DAG: error: stub helper header is out of range: [[#]] is not in [-2147483648, 2147483647]
thakis added a comment.Jun 8 2022, 1:20 PM

No, I'll land this tonight, thanks!

This revision was landed with ongoing or failed builds.Jun 8 2022, 5:12 PM
This revision was automatically updated to reflect the committed changes.