This is an archive of the discontinued LLVM Phabricator instance.

[COFF] Add a fastpath for /INCLUDE: in .drective sections
ClosedPublic

Authored by rnk on Apr 24 2020, 5:44 PM.

Details

Summary

This speeds up linking chrome.dll with PGO instrumentation by 13%
(154271ms -> 134033ms).

LLVM's Option library is very slow. In particular, it allocates at least
one large-ish heap object (Arg) for every argument. When PGO
instrumentation is enabled, all the __profd_* symbols are added to the
@llvm.used list, which compiles down to these /INCLUDE: directives. This
means we have O(#symbols) directives to parse in the section, so we end
up allocating an Arg for every function symbol in the object file. This
is unnecessary.

To address the issue and speed up the link, extend the fast path that we
already have for /EXPORT:, which has similar scaling issues.

I promise that I took a hard look at optimizing the Option library, but
its data structures are very general and would need a lot of cleanup. We
have accumulated lots of optional features (option groups, aliases,
multiple values) over the years, and these are now properties of every
parsed argument, when the vast majority of arguments do not use these
features.

Diff Detail

Event Timeline

rnk created this revision.Apr 24 2020, 5:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2020, 5:44 PM
thakis accepted this revision.Apr 24 2020, 7:12 PM

Nice!

This revision is now accepted and ready to land.Apr 24 2020, 7:12 PM

I promise that I took a hard look at optimizing the Option library, ... features. multiple values

Sigh, the only user of multiple values: https://reviews.llvm.org/D62070#2000628

aganea added a comment.EditedApr 25 2020, 9:46 AM

As for LLVMOptions, what prevents a BumpAllocator + placement new on the Arg(s)? Or is the perf. wasted somewhere else?

Side-node: I was profiling the build-time regression between Clang 9, 10 & 11 on building LLVM and building a few of our games, with and without debug info. There's a severe regression, +10% CPU without debug info, and +15 to +18% with debug info, from Clang 9 to 10. Clang 11 adds an extra +2%. Additionally, no matter from what angle I look, allocations in clang take 10.5%-11% of the total CPU time (on Windows with the standard heap allocator). Replacing the allocator reduces that to 3.5% CPU for allocations, and improves some bandwidth/cache-sensitive functions along the way, which effectively reduce CPU usage by 15% (compared to baseline Clang 10). But at the same time it swipes the issue under the carpet. This all seems related to the amount of (small) allocations in LLVM generally.

lld/COFF/Driver.h
51

Curious: how many includes do you have per .drective? It is worth calling .reserve() somehow before inserting? MS-STL has geometric increase of the std::vector buffer. If your .drective has many tokens, we would probably allocate & move memory several times per .drective. I think includes.reserve(tokensNum), exports.reserve(tokensNum) is better that the cost of re-alloc, even we're wasting a few extra memory. Unless you do a two-step parsing.

lld/COFF/DriverUtils.cpp
869

tokenize() allocates, ie. the returned std::vector doesn't .reserve() when constructed with a range (at least in MS STL 2019). That would be a good candidate for further optimization. Especially since you don't need the 'saver'.

rnk added a comment.Apr 28 2020, 10:11 AM

As for LLVMOptions, what prevents a BumpAllocator + placement new on the Arg(s)? Or is the perf. wasted somewhere else?

I think that's a good first step. After that, I would focus on making the object leaner. It is huge.

The Option class is two pointers large, and we pass it by value. It needs a pointer to the parent option table so that it can implement getAliasedOption and getOptionGroup (sp). I think we could trim out the parent pointer by assuming that the option Info structs are laid out in an array indexed by option ID. Subtract the current option pointer by the ID, and then add back the ID of the option.

For Arg itself, I would suggest making all seldomly used fields members of TrailingObjects. This complicates the use of BumpPtrAllocator, but seems worth it.

Side-node: I was profiling the build-time regression between Clang 9, 10 & 11 on building LLVM and building a few of our games, with and without debug info. There's a severe regression, +10% CPU without debug info, and +15 to +18% with debug info, from Clang 9 to 10. Clang 11 adds an extra +2%. Additionally, no matter from what angle I look, allocations in clang take 10.5%-11% of the total CPU time (on Windows with the standard heap allocator). Replacing the allocator reduces that to 3.5% CPU for allocations, and improves some bandwidth/cache-sensitive functions along the way, which effectively reduce CPU usage by 15% (compared to baseline Clang 10). But at the same time it swipes the issue under the carpet. This all seems related to the amount of (small) allocations in LLVM generally.

I think the Google C++ production toolchain team noticed similar results and switched to tcmalloc to achieve the same thing.

I think early in LLVM project history, developers did a lot of micro-optimization focusing on reducing heap allocations (see prevalence (and overuse!) of SmallVector), and a lot of that has gone by the wayside as generic containers proliferate in new code.

However, I know that for the option library specifically, performance was not a priority because it was considered to only impact startup time, and therefore not worth optimizing. Reusing it for .drective parsing where throughput is important takes it outside of the original problem domain.


As a next step, I noticed that cl::TokenizeWindowsCommandLine copies all strings. ;_; When I initially wrote it, I had intended that it would only copy an argument in the case that it had to deal with quotations, but it looks like some developer has "helpfully" fixed a use after free by making it always copy. :(

This revision was automatically updated to reflect the committed changes.
rnk added a comment.Apr 28 2020, 11:00 AM
In D78845#2008164, @rnk wrote:

As a next step, I noticed that cl::TokenizeWindowsCommandLine copies all strings. ;_; When I initially wrote it, I had intended that it would only copy an argument in the case that it had to deal with quotations, but it looks like some developer has "helpfully" fixed a use after free by making it always copy. :(

Nevermind, ignore that part of the comment. I just imagined that I had implemented this optimization. :) I wasn't able to implement this optimization because the Option library really wants to work with null-terminated strings, and we have to copy to get that null termination.

To avoid copying all strings during tokenization, we would need to start by relaxing that null terminated string requirement in the Option library. The string copies should already use BumpPtrAllocator, since they use StringSaver. This optimization should meaningfully reduce memory usage for PGO instrumented builds, because these strings currently live forever, and that is unnecessary.

In D78845#2008164, @rnk wrote:

I think early in LLVM project history, developers did a lot of micro-optimization focusing on reducing heap allocations (see prevalence (and overuse!) of SmallVector), and a lot of that has gone by the wayside as generic containers proliferate in new code.

I think the main issue is that there's no easy way to "see" how the allocations scale across LLVM in general. How many of them, where they are done, how many per sec., diff. against a previous build, etc.
We could implement a system to "tag" each allocation and save it along with the callstack. It would then easy to save binary snapshots and inspect where the allocations go, do diff-ing, etc.
It just takes someone that has time to do it ;-)