On top of https://reviews.llvm.org/D113063 this took another 10 seconds
off our overall link time.
Details
- Reviewers
gkm int3 - Group Reviewers
Restricted Project - Commits
- rGf79e65e61faf: [lld-macho] Cache library paths from findLibrary
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
lld/MachO/Driver.cpp | ||
---|---|---|
85–106 | what value does the lambda add here? |
lld/MachO/Driver.cpp | ||
---|---|---|
85–106 | IMO the reduced duplication is a big benefit |
By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)
Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.
Thanks!
lld/MachO/Driver.cpp | ||
---|---|---|
1098 | ultra nit: can we make sure resolvedLibraries.clear() is next to resolvedFrameworks.clear()? |
For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked?
I'm not sure how many static libraries our iOS applications have, but I'll get back to you when I've run the numbers. One guess is that you guys might be using a lot more Swift than we do, and swiftc makes extensive use of LC_LINKER_OPTION, which might explain why there are so many flags to parse / paths to search.
lld/MachO/Driver.cpp | ||
---|---|---|
1098 | Sounds good I'll do that in the other diff since that one isn't reviewed yet |
For one of our largest ios apps that I've measured(different from int3's ):
- ~7100 archives
- 56 frameworks (including system ones)
- 12 weak frameworks
Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile
Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile
the bottleneck right now is it crashes in linking ... ;) I'll post an update ~ next week
If you can generate a reproducer tarball (--reproduce) and upload it somewhere, that might be interesting. I know LLD COFF does asynchronous file reads because of file I/O generally being slower on Windows; maybe that's something that'd help on macOS as well?
It looks like the issue is that there should be some caching on some of the system dylibs that are loaded. It appears we are currently doing about 760k readFiles but there are only 1500 unique ones. I'm preparing a patch for discussion that mirrors ld64's behavior to cache these as they read them
*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)
x ./LLD_with_cache_patches.txt + ./LD64_local_imprv.txt +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |x * x *x x * * + + + + + +| | |_|________A__M______|__________________M____________A____________________________________________________| | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 23.64 23.89 23.82 23.799 0.077667382 + 10 23.69 25.01 24.04 24.146 0.4379041 Difference at 95.0% confidence 0.347 +/- 0.295482 1.45804% +/- 1.24157% (Student's t, pooled s = 0.314478)
Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...
Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?
[...]
By the way, thanks for contributing all these optimizations! I was quite surprised to hear that ld64 was faster, given that LLD is typically much faster for our own workloads, but I guess you have rather different inputs. Hopefully we can make LLD the fastest Mach-O linker for all builds :)
Thanks for all the reviews! For context our project is a huge iOS application with on the order of thousands of static libraries, and many iOS system framework + system library dependencies. Is this a case you've benchmarked? If so I'd be interested to dig a bit deeper into the differences to try and understand why it has been slower for us.
For one of our largest ios apps that I've measured(different from int3's ):
- ~7100 archives
- 56 frameworks (including system ones)
- 12 weak frameworks
Thanks for the info! What's your bottleneck at this point? After all my changes here our biggest area is we now spend 20+ seconds in lld::macho::readFile
*With* all of the patches improving load input applied, linking the app above is only slightly faster than our optimised LD64:
(cross-linking, on Linux - Broadwell (yeah, kind of old) RAM 32GB)x ./LLD_with_cache_patches.txt + ./LD64_local_imprv.txt +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |x * x *x x * * + + + + + +| | |_|________A__M______|__________________M____________A____________________________________________________| | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 23.64 23.89 23.82 23.799 0.077667382 + 10 23.69 25.01 24.04 24.146 0.4379041 Difference at 95.0% confidence 0.347 +/- 0.295482 1.45804% +/- 1.24157% (Student's t, pooled s = 0.314478)Looking at the trace, the bottleneck is still in loading input and writing output.
It seems it could be improved further ...Thanks for the data point! You might have mentioned before but does this mean these changes _were_ an improvement for you as well? Do you have a time for this benchmark before these changes?
Yes - these changes improved it a bit.
without these improvement, LLD vs LD64 (our fork) :
x LLD_no_cache.txt + LD64_local_imprv.txt +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | + x * + + x x x + + x + x x + + + x| ||_____|___________________________________________M___________M____A____________A______________________________________________________|_________________| | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 23.71 25.16 23.97 24.073 0.40136019 + 10 23.69 25.01 24.04 24.146 0.4379041 No difference proven at 95.0% confidence
That's too bad, I guess our use cases are just different enough. But good to know it didn't regress, thanks!
This is interesting. LLD has been at least 2x faster for us than ld64 (and also a bit faster than zld, which is a public ld64 fork with speed improvements), and I believe it was 3 to 4x faster for @thakis. You have some impressive ld64 improvements :)
It's interesting that you have the large gap between parsing input files and the next trace. I haven't seen that on our end yet.
I spent some time looking at the performance for loading inputs. Most of the time was going into parsing symbols and relocations, as expected.
Doing symbol resolution in parallel seems pretty tricky because of archive loading semantics and weak symbols. A while back, @ruiu had the idea to at least parallelize all the symbol table insertions (https://lists.llvm.org/pipermail/llvm-dev/2019-April/131902.html), but I don't think he prototyped it (or at least I can't find the results from the prototype anywhere). More recently, @int3 was looking into parallelizing bits of LLD (https://lists.llvm.org/pipermail/llvm-dev/2021-April/149651.html), and there was some interesting discussion there (particularly Jez's follow-up in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149675.html, and @rnk's parallel symbol resolution idea in https://lists.llvm.org/pipermail/llvm-dev/2021-April/149689.html).
For relocation processing specifically, the expensive parts on our end were (a) fetching embedded addends from the instruction stream (which results in lots of cache misses), and (b) for relocations pointing to a section (i.e. where the r_extern bit isn't set), finding the subsection to associate them with. (a) seems kinda unavoidable, and for (b), I tried packing the offsets tightly into a packed array instead of having them spread out across a struct, which didn't help, and doing a linear search instead of a binary search, which was much worse. I also had the idea of parallelizing relocation parsing across all InputFiles (it's after symbol resolution, so it should be trivially parallelizable), but that also didn't make any difference, surprisingly (I only tried it with one link, and I haven't looked into why). (A WIP sketch of the parallel relocation processing which won't work for LTO is in D113542, for reference.)
(Update: I think I might have been measuring parallel relocation processing wrong; I'm seeing a 20% win from it now. I'll need to test it more though.)
.... [cropped]
This is interesting. LLD has been at least 2x faster for us than ld64 (and also a bit faster than zld, which is a public ld64 fork with speed improvements), and I believe it was 3 to 4x faster for @thakis. You have some impressive ld64 improvements :)
It's interesting that you have the large gap between parsing input files and the next trace. I haven't seen that on our end yet.
This is likely because for this particular app (actually also quite true for most of our apps), the input set is huge - it seems LLD pends most of its time blocking on disk-read.
By contrast, ld64 parses all input files up front in parallel. (same for archives - it all loads all members that it knows will be needed thanks to -ObjC /-all_load/-force_load).
(at least, some good news is, as shown below, LLD's cpu time is less than that of LD64 ;) - so it's efficient - just not as parallel as it could be)
SYSTEM CPU time: x ./LD64_res2.txt + ./LLD_released_res2.txt N Min Max Median Avg Stddev x 5 1.84 2.01 1.9 1.916 0.065802736 + 5 3.28 3.36 3.29 3.312 0.035637059 Difference at 95.0% confidence 1.396 +/- 0.0771735 72.8601% +/- 4.02785% (Student's t, pooled s = 0.052915) USER CPU time: x ./LD64_res2.txt + ./LLD_released_res2.txt N Min Max Median Avg Stddev x 5 12.04 12.28 12.06 12.124 0.10644247 + 5 5.51 5.66 5.54 5.562 0.060991803 Difference at 95.0% confidence -6.562 +/- 0.126515 -54.1241% +/- 1.04351% (Student's t, pooled s = 0.0867468) WALL time: x ./LD64_res2.txt + ./LLD_released_res2.txt N Min Max Median Avg Stddev x 5 12.42 12.59 12.54 12.516 0.064265076 + 5 15.01 15.26 15.21 15.184 0.10139033 Difference at 95.0% confidence 2.668 +/- 0.123796 21.3167% +/- 0.989101% (Student's t, pooled s = 0.0848823)
I spent some time looking at the performance for loading inputs. Most of the time was going into parsing symbols and relocations, as expected.
do you have stats on how much time is spent in parsing relative to reloc handling?
ditto about cleanupCallback (as mentioned in D113063)