This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Cache library paths from findLibrary
ClosedPublic

Authored by keith on Nov 2 2021, 5:23 PM.

Details

Reviewers
gkm
int3
Group Reviewers
Restricted Project
Commits
rGf79e65e61faf: [lld-macho] Cache library paths from findLibrary
Summary

On top of https://reviews.llvm.org/D113063 this took another 10 seconds
off our overall link time.

Diff Detail

Event Timeline

keith created this revision.Nov 2 2021, 5:23 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
keith requested review of this revision.Nov 2 2021, 5:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 2 2021, 5:23 PM
oontvoo added a subscriber: oontvoo.Nov 2 2021, 6:36 PM
oontvoo added inline comments.
lld/MachO/Driver.cpp
85–106
101

should check if(path) first?

int3 added a subscriber: int3.Nov 2 2021, 7:15 PM
int3 added inline comments.
lld/MachO/Driver.cpp
85–106

what value does the lambda add here?

keith updated this revision to Diff 384305.Nov 2 2021, 7:15 PM
keith marked 2 inline comments as done.

Use lambda instead of duplication

lld/MachO/Driver.cpp
85–106

Ah great idea, thanks!

keith added inline comments.Nov 2 2021, 7:16 PM
lld/MachO/Driver.cpp
85–106

IMO the reduced duplication is a big benefit

int3 added inline comments.Nov 2 2021, 7:36 PM
lld/MachO/Driver.cpp
83

ditto about cleanupCallback (as mentioned in D113063)

85–106

oh right. I saw it only being invoked once and got confused, but I see that it's actually factoring out the assignment. I should read more carefully before commenting 😅

int3 added a comment.Nov 2 2021, 7:39 PM

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

keith updated this revision to Diff 384326.Nov 2 2021, 9:22 PM
keith marked 2 inline comments as done.

Add cleaning logic

keith added a comment.Nov 2 2021, 9:23 PM

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

keith updated this revision to Diff 384332.Nov 2 2021, 9:27 PM

Fix patch

int3 accepted this revision.Nov 3 2021, 7:53 AM

Thanks!

lld/MachO/Driver.cpp
1098

ultra nit: can we make sure resolvedLibraries.clear() is next to resolvedFrameworks.clear()?

This revision is now accepted and ready to land.Nov 3 2021, 7:53 AM
int3 added a comment.Nov 3 2021, 7:57 AM

For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked?

I'm not sure how many static libraries our iOS applications have, but I'll get back to you when I've run the numbers. One guess is that you guys might be using a lot more Swift than we do, and swiftc makes extensive use of LC_LINKER_OPTION, which might explain why there are so many flags to parse / paths to search.

keith added a comment.Nov 3 2021, 8:43 AM

For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked?

I'm not sure how many static libraries our iOS applications have, but I'll get back to you when I've run the numbers. One guess is that you guys might be using a lot more Swift than we do, and swiftc makes extensive use of LC_LINKER_OPTION, which might explain why there are so many flags to parse / paths to search.

Yep our thousands are 99.9% Swift

keith marked 2 inline comments as done.Nov 3 2021, 8:45 AM
keith added inline comments.
lld/MachO/Driver.cpp
1098

Sounds good I'll do that in the other diff since that one isn't reviewed yet

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks
keith marked an inline comment as done.Nov 3 2021, 9:47 AM

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

the bottleneck right now is it crashes in linking ... ;) I'll post an update ~ next week

This revision was automatically updated to reflect the committed changes.
smeenai added a subscriber: smeenai.Nov 3 2021, 1:48 PM

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

If you can generate a reproducer tarball (--reproduce) and upload it somewhere, that might be interesting. I know LLD COFF does asynchronous file reads because of file I/O generally being slower on Windows; maybe that's something that'd help on macOS as well?

keith added a comment.Nov 3 2021, 1:52 PM

It looks like the issue is that there should be some caching on some of the system dylibs that are loaded. It appears we are currently doing about 760k readFiles but there are only 1500 unique ones. I'm preparing a patch for discussion that mirrors ld64's behavior to cache these as they read them

keith added a comment.Nov 3 2021, 5:18 PM

Here's that change for discussion https://reviews.llvm.org/D113153

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)

x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)

Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...

keith added a comment.Nov 9 2021, 9:05 AM

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)

x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)

Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...

Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

[...]

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)

x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)

Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...

Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

Yes - these changes improved it a bit.

without these improvement, LLD vs LD64 (our fork) :

x LLD_no_cache.txt
+ LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   +  x                *         +   +    x    x  x +         + x  +  x    x                          +                                                                                          +                               +                        x|
||_____|___________________________________________M___________M____A____________A______________________________________________________|_________________|                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.71         25.16         23.97        24.073    0.40136019
+  10         23.69         25.01         24.04        24.146     0.4379041
No difference proven at 95.0% confidence
keith added a comment.Nov 9 2021, 9:53 AM

[...]

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)

x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)

Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...

Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

Yes - these changes improved it a bit.

without these improvement, LLD vs LD64 (our fork) :

x LLD_no_cache.txt
+ LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   +  x                *         +   +    x    x  x +         + x  +  x    x                          +                                                                                          +                               +                        x|
||_____|___________________________________________M___________M____A____________A______________________________________________________|_________________|                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.71         25.16         23.97        24.073    0.40136019
+  10         23.69         25.01         24.04        24.146     0.4379041
No difference proven at 95.0% confidence

That's too bad, I guess our use cases are just different enough. But good to know it didn't regress, thanks!

smeenai added subscribers: ruiu, thakis, rnk.

By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)

Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.

For one of our largest ios apps that I've measured(different from int3's ):

  • ~7100 archives
  • 56 frameworks (including system ones)
  • 12 weak frameworks

Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile

*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)

x ./LLD_with_cache_patches.txt
+ ./LD64_local_imprv.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x     *           x  *x  x  * *          +      +   +                         +                                                                +                      +|
|        |_|________A__M______|__________________M____________A____________________________________________________|                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         23.64         23.89         23.82        23.799   0.077667382
+  10         23.69         25.01         24.04        24.146     0.4379041
Difference at 95.0% confidence
	0.347 +/- 0.295482
	1.45804% +/- 1.24157%
	(Student's t, pooled s = 0.314478)

Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...

Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?

This is interesting. LLD has been at least 2x faster for us than ld64 (and also a bit faster than zld, which is a public ld64 fork with speed improvements), and I believe it was 3 to 4x faster for @thakis. You have some impressive ld64 improvements :)

It's interesting that you have the large gap between parsing input files and the next trace. I haven't seen that on our end yet.

I spent some time looking at the performance for loading inputs. Most of the time was going into parsing symbols and relocations, as expected.

Doing symbol resolution in parallel seems pretty tricky because of archive loading semantics and weak symbols. A while back, @ruiu had the idea to at least parallelize all the symbol table insertions (https://lists.llvm.org/pipermail/llvm-dev/2019-April/131902.html), but I don't think he prototyped it (or at least I can't find the results from the prototype anywhere). More recently, @int3 was looking into parallelizing bits of LLD (https://lists.llvm.org/pipermail/llvm-dev/2021-April/149651.html), and there was some interesting discussion there (particularly Jez's follow-up in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149675.html, and @rnk's parallel symbol resolution idea in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149689.html).

For relocation processing specifically, the expensive parts on our end were (a) fetching embedded addends from the instruction stream (which results in lots of cache misses), and (b) for relocations pointing to a section (i.e. where the r_extern bit isn't set), finding the subsection to associate them with. (a) seems kinda unavoidable, and for (b), I tried packing the offsets tightly into a packed array instead of having them spread out across a struct, which didn't help, and doing a linear search instead of a binary search, which was much worse. I also had the idea of parallelizing relocation parsing across all InputFiles (it's after symbol resolution, so it should be trivially parallelizable), but that also didn't make any difference, surprisingly (I only tried it with one link, and I haven't looked into why). (A WIP sketch of the parallel relocation processing which won't work for LTO is in D113542, for reference.)

(Update: I think I might have been measuring parallel relocation processing wrong; I'm seeing a 20% win from it now. I'll need to test it more though.)

.... [cropped]

This is interesting. LLD has been at least 2x faster for us than ld64 (and also a bit faster than zld, which is a public ld64 fork with speed improvements), and I believe it was 3 to 4x faster for @thakis. You have some impressive ld64 improvements :)

It's interesting that you have the large gap between parsing input files and the next trace. I haven't seen that on our end yet.

This is likely because for this particular app (actually also quite true for most of our apps), the input set is huge - it seems LLD pends most of its time blocking on disk-read.

By contrast, ld64 parses all input files up front in parallel. (same for archives - it all loads all members that it knows will be needed thanks to -ObjC /-all_load/-force_load).

(at least, some good news is, as shown below, LLD's cpu time is less than that of LD64 ;) - so it's efficient - just not as parallel as it could be)

SYSTEM CPU time: 
x ./LD64_res2.txt
+ ./LLD_released_res2.txt
    N           Min           Max        Median           Avg        Stddev
x   5          1.84          2.01           1.9         1.916   0.065802736
+   5          3.28          3.36          3.29         3.312   0.035637059
Difference at 95.0% confidence
    1.396 +/- 0.0771735
    72.8601% +/- 4.02785%
    (Student's t, pooled s = 0.052915)
USER CPU time: 
x ./LD64_res2.txt
+ ./LLD_released_res2.txt
    N           Min           Max        Median           Avg        Stddev
x   5         12.04         12.28         12.06        12.124    0.10644247
+   5          5.51          5.66          5.54         5.562   0.060991803
Difference at 95.0% confidence
    -6.562 +/- 0.126515
    -54.1241% +/- 1.04351%
    (Student's t, pooled s = 0.0867468)
WALL time: 
x ./LD64_res2.txt
+ ./LLD_released_res2.txt
    N           Min           Max        Median           Avg        Stddev
x   5         12.42         12.59         12.54        12.516   0.064265076
+   5         15.01         15.26         15.21        15.184    0.10139033
Difference at 95.0% confidence
    2.668 +/- 0.123796
    21.3167% +/- 0.989101%
    (Student's t, pooled s = 0.0848823)

I spent some time looking at the performance for loading inputs. Most of the time was going into parsing symbols and relocations, as expected.

do you have stats on how much time is spent in parsing relative to reloc handling?