Download Raw Diff

Details

Reviewers

gkm
int3

Group Reviewers

Restricted Project

Commits

rGf79e65e61faf: [lld-macho] Cache library paths from findLibrary

Summary

On top of https://reviews.llvm.org/D113063 this took another 10 seconds
off our overall link time.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

keith created this revision.Nov 2 2021, 5:23 PM

Herald added a reviewer: gkm. · View Herald TranscriptNov 2 2021, 5:23 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a reviewer: Restricted Project. · View Herald Transcript

keith requested review of this revision.Nov 2 2021, 5:23 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 2 2021, 5:23 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B132111: Diff 384285.Nov 2 2021, 5:23 PM

oontvoo added a subscriber: oontvoo.Nov 2 2021, 6:36 PM

oontvoo added inline comments.

lld/MachO/Driver.cpp
85–106
104	should check `if(path)` first?

int3 added a subscriber: int3.Nov 2 2021, 7:15 PM

int3 added inline comments.

lld/MachO/Driver.cpp
85–106	what value does the lambda add here?

Use lambda instead of duplication

lld/MachO/Driver.cpp
85–106	Ah great idea, thanks!

Harbormaster completed remote builds in B132125: Diff 384305.Nov 2 2021, 7:16 PM

keith added inline comments.Nov 2 2021, 7:16 PM

lld/MachO/Driver.cpp
85–106	IMO the reduced duplication is a big benefit

int3 mentioned this in D113063: [lld-macho] Cache discovered framework paths.Nov 2 2021, 7:20 PM

int3 added inline comments.Nov 2 2021, 7:36 PM

lld/MachO/Driver.cpp
83	ditto about `cleanupCallback` (as mentioned in D113063)
85–106	oh right. I saw it only being invoked once and got confused, but I see that it's actually factoring out the assignment. I should read more carefully before commenting 😅

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Add cleaning logic

Harbormaster completed remote builds in B132138: Diff 384326.Nov 2 2021, 9:22 PM

In D113073#3104904, @int3 wrote:

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

Fix patch

Harbormaster completed remote builds in B132139: Diff 384332.Nov 2 2021, 9:59 PM

Thanks!

lld/MachO/Driver.cpp
1097	ultra nit: can we make sure `resolvedLibraries.clear()` is next to `resolvedFrameworks.clear()`?

This revision is now accepted and ready to land.Nov 3 2021, 7:53 AM

For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked?

I'm not sure how many static libraries our iOS applications have, but I'll get back to you when I've run the numbers. One guess is that you guys might be using a lot more Swift than we do, and swiftc makes extensive use of LC_LINKER_OPTION, which might explain why there are so many flags to parse / paths to search.

In D113073#3106071, @int3 wrote:

For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked?

I'm not sure how many static libraries our iOS applications have, but I'll get back to you when I've run the numbers. One guess is that you guys might be using a lot more Swift than we do, and swiftc makes extensive use of LC_LINKER_OPTION, which might explain why there are so many flags to parse / paths to search.

Yep our thousands are 99.9% Swift

keith marked 2 inline comments as done.Nov 3 2021, 8:45 AM

keith added inline comments.

lld/MachO/Driver.cpp
1097	Sounds good I'll do that in the other diff since that one isn't reviewed yet

In D113073#3105039, @keith wrote:

In D113073#3104904, @int3 wrote:

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives
56 frameworks (including system ones)
12 weak frameworks

In D113073#3106474, @oontvoo wrote:

In D113073#3105039, @keith wrote:

In D113073#3104904, @int3 wrote:

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives

56 frameworks (including system ones)

12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

the bottleneck right now is it crashes in linking ... ;) I'll post an update ~ next week

Closed by commit rGf79e65e61faf: [lld-macho] Cache library paths from findLibrary (authored by keith). · Explain WhyNov 3 2021, 10:07 AM

This revision was automatically updated to reflect the committed changes.

keith added a commit: rGf79e65e61faf: [lld-macho] Cache library paths from findLibrary.

In D113073#3106528, @keith wrote:

In D113073#3106474, @oontvoo wrote:

In D113073#3105039, @keith wrote:

In D113073#3104904, @int3 wrote:

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives

56 frameworks (including system ones)

12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

If you can generate a reproducer tarball (--reproduce) and upload it somewhere, that might be interesting. I know LLD COFF does asynchronous file reads because of file I/O generally being slower on Windows; maybe that's something that'd help on macOS as well?

It looks like the issue is that there should be some caching on some of the system dylibs that are loaded. It appears we are currently doing about 760k readFiles but there are only 1500 unique ones. I'm preparing a patch for discussion that mirrors ld64's behavior to cache these as they read them

Here's that change for discussion https://reviews.llvm.org/D113153

In D113073#3106528, @keith wrote:

In D113073#3106474, @oontvoo wrote:

In D113073#3105039, @keith wrote:

In D113073#3104904, @int3 wrote:

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives

56 frameworks (including system ones)

12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)

x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)

Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...

In D113073#3118477, @oontvoo wrote:
In D113073#3106528, @keith wrote:

In D113073#3106474, @oontvoo wrote:

In D113073#3105039, @keith wrote:

In D113073#3104904, @int3 wrote:

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives

56 frameworks (including system ones)

12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)
x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)
Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...

Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

In D113073#3118817, @keith wrote:

[...]

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives

56 frameworks (including system ones)

12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)
x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)
Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...
Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

Yes - these changes improved it a bit.

without these improvement, LLD vs LD64 (our fork) :

x LLD_no_cache.txt
+ LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   +  x                *         +   +    x    x  x +         + x  +  x    x                          +                                                                                          +                               +                        x|
||_____|___________________________________________M___________M____A____________A______________________________________________________|_________________|                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.71         25.16         23.97        24.073    0.40136019
+  10         23.69         25.01         24.04        24.146     0.4379041
No difference proven at 95.0% confidence

In D113073#3118944, @oontvoo wrote:
In D113073#3118817, @keith wrote:

[...]
By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives

56 frameworks (including system ones)

12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)
x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)
Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...
Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?
Yes - these changes improved it a bit.

without these improvement, LLD vs LD64 (our fork) :
x LLD_no_cache.txt
+ LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   +  x                *         +   +    x    x  x +         + x  +  x    x                          +                                                                                          +                               +                        x|
||_____|___________________________________________M___________M____A____________A______________________________________________________|_________________|                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.71         25.16         23.97        24.073    0.40136019
+  10         23.69         25.01         24.04        24.146     0.4379041
No difference proven at 95.0% confidence

That's too bad, I guess our use cases are just different enough. But good to know it didn't regress, thanks!

In D113073#3118817, @keith wrote:
In D113073#3118477, @oontvoo wrote:
In D113073#3106528, @keith wrote:

In D113073#3106474, @oontvoo wrote:

In D113073#3105039, @keith wrote:

In D113073#3104904, @int3 wrote:

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

~7100 archives

56 frameworks (including system ones)

12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)
x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)
Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...
Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

This is interesting. LLD has been at least 2x faster for us than ld64 (and also a bit faster than zld, which is a public ld64 fork with speed improvements), and I believe it was 3 to 4x faster for @thakis. You have some impressive ld64 improvements :)

It's interesting that you have the large gap between parsing input files and the next trace. I haven't seen that on our end yet.

I spent some time looking at the performance for loading inputs. Most of the time was going into parsing symbols and relocations, as expected.

Doing symbol resolution in parallel seems pretty tricky because of archive loading semantics and weak symbols. A while back, @ruiu had the idea to at least parallelize all the symbol table insertions (https://lists.llvm.org/pipermail/llvm-dev/2019-April/131902.html), but I don't think he prototyped it (or at least I can't find the results from the prototype anywhere). More recently, @int3 was looking into parallelizing bits of LLD (https://lists.llvm.org/pipermail/llvm-dev/2021-April/149651.html), and there was some interesting discussion there (particularly Jez's follow-up in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149675.html, and @rnk's parallel symbol resolution idea in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149689.html).

For relocation processing specifically, the expensive parts on our end were (a) fetching embedded addends from the instruction stream (which results in lots of cache misses), and (b) for relocations pointing to a section (i.e. where the r_extern bit isn't set), finding the subsection to associate them with. (a) seems kinda unavoidable, and for (b), I tried packing the offsets tightly into a packed array instead of having them spread out across a struct, which didn't help, and doing a linear search instead of a binary search, which was much worse. I also had the idea of parallelizing relocation parsing across all InputFiles (it's after symbol resolution, so it should be trivially parallelizable), but that also didn't make any difference, surprisingly (I only tried it with one link, and I haven't looked into why). (A WIP sketch of the parallel relocation processing which won't work for LTO is in D113542, for reference.)

(Update: I think I might have been measuring parallel relocation processing wrong; I'm seeing a 20% win from it now. I'll need to test it more though.)

In D113073#3120527, @smeenai wrote:

.... [cropped]

This is interesting. LLD has been at least 2x faster for us than ld64 (and also a bit faster than zld, which is a public ld64 fork with speed improvements), and I believe it was 3 to 4x faster for @thakis. You have some impressive ld64 improvements :)

It's interesting that you have the large gap between parsing input files and the next trace. I haven't seen that on our end yet.

This is likely because for this particular app (actually also quite true for most of our apps), the input set is huge - it seems LLD pends most of its time blocking on disk-read.

By contrast, ld64 parses all input files up front in parallel. (same for archives - it all loads all members that it knows will be needed thanks to -ObjC /-all_load/-force_load).

(at least, some good news is, as shown below, LLD's cpu time is less than that of LD64 ;) - so it's efficient - just not as parallel as it could be)

SYSTEM CPU time: 
x ./LD64_res2.txt
+ ./LLD_released_res2.txt
    N           Min           Max        Median           Avg        Stddev
x   5          1.84          2.01           1.9         1.916   0.065802736
+   5          3.28          3.36          3.29         3.312   0.035637059
Difference at 95.0% confidence
    1.396 +/- 0.0771735
    72.8601% +/- 4.02785%
    (Student's t, pooled s = 0.052915)
USER CPU time: 
x ./LD64_res2.txt
+ ./LLD_released_res2.txt
    N           Min           Max        Median           Avg        Stddev
x   5         12.04         12.28         12.06        12.124    0.10644247
+   5          5.51          5.66          5.54         5.562   0.060991803
Difference at 95.0% confidence
    -6.562 +/- 0.126515
    -54.1241% +/- 1.04351%
    (Student's t, pooled s = 0.0867468)
WALL time: 
x ./LD64_res2.txt
+ ./LLD_released_res2.txt
    N           Min           Max        Median           Avg        Stddev
x   5         12.42         12.59         12.54        12.516   0.064265076
+   5         15.01         15.26         15.21        15.184    0.10139033
Difference at 95.0% confidence
    2.668 +/- 0.123796
    21.3167% +/- 0.989101%
    (Student's t, pooled s = 0.0848823)

I spent some time looking at the performance for loading inputs. Most of the time was going into parsing symbols and relocations, as expected.

do you have stats on how much time is spent in parsing relative to reloc handling?

This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Cache library paths from findLibrary
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 384285

lld/MachO/Driver.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Cache library paths from findLibraryClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 384285

lld/MachO/Driver.cpp

[lld-macho] Cache library paths from findLibrary
ClosedPublic