Page MenuHomePhabricator

[InstrProfiling] Generate runtime hook for ELF platforms
AcceptedPublic

Authored by phosek on Mar 5 2021, 10:49 AM.

Details

Summary

When using -fprofile-list to selectively apply instrumentation only
to certain files or functions, we may end up with a binary that doesn't
have any counters in the case where no files were selected. However,
because on Linux and Fuchsia, we pass -ullvm_profile_runtime, the
runtime would still be pulled in and incur some non-trivial overhead,
especially in the case when the continuous or runtime counter relocation
mode is being used. A better way would be to pull in the profile runtime
only when needed by declaring the
llvm_profile_runtime symbol in the
translation unit only when needed.

This approach was already used prior to 9a041a75221ca, but we changed it
to always generate the llvm_profile_runtime due to a TAPI limitation.
Since TAPI is only used on Mach-O platforms, we could use the early
emission of
llvm_profile_runtime there, and on other platforms we
could change back to the earlier approach where the symbol is generated
later only when needed. We can stop passing -u__llvm_profile_runtime to
the linker on Linux and Fuchsia since the generated undefined symbol in
each translation unit that needed it serves the same purpose.

Diff Detail

Event Timeline

phosek created this revision.Mar 5 2021, 10:49 AM
phosek requested review of this revision.Mar 5 2021, 10:49 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 5 2021, 10:49 AM
phosek added inline comments.Mar 5 2021, 10:50 AM
llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1082–1084

@vsk do you know why we need this function instead of just using llvm.compiler.used/llvm.used for the symbol? I used that approach for ELF and it seems to be working fine.

phosek added a comment.Mar 9 2021, 2:46 PM

@vsk do you have any thoughts on this?

vsk added a subscriber: ributzka.Mar 9 2021, 2:57 PM

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1082–1084

I don't have the context for this, since this code is from before I started working on llvm. I'm guessing, but maybe it's possible that llvm(.compiler)?.used didn't exist or work well when this code was written.

phosek marked an inline comment as done.Mar 9 2021, 3:07 PM
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1082–1084

Would it be OK with you if I sent out a separate change to remove this?

vsk added a comment.Mar 9 2021, 3:28 PM
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

What kind of impact do you see? If #counters > 0, is it mostly binary size cost? If #counters == 0, what's the avg. overhead of writing out the full profile? Can it be fixed by doing an early-exit in the runtime initializer, writing out an empty .profraw?

That raises a question about tooling support: some workflows (like the Xcode coverage plugin) might assume that a program compiled with -fprofile-instr-generate always creates a .profraw. If there's no profile written at all for the #counters == 0 case, that could be a breaking change.

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1082–1084

Thanks, yes, that would be great.

phosek marked an inline comment as done.Mar 9 2021, 3:56 PM
In D98061#2615334, @vsk wrote:
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

What kind of impact do you see? If #counters > 0, is it mostly binary size cost? If #counters == 0, what's the avg. overhead of writing out the full profile?

It depends a bit on the runtime and the platform. In Fuchsia where we always use the continuous mode, there's quite a bit of upfront cost (see https://github.com/llvm/llvm-project/blob/75f3f778052cdcd89e79f7a42a50915ee5ce2281/compiler-rt/lib/profile/InstrProfilingPlatformFuchsia.c#L109), we need to allocate a memory object, map it into address space, publish it, write some additional information into the log.

While it may not be so bad for a single binary, over hundreds of tests it adds up and with this change we saw the total test execution time go down from 30 to 18 minutes. There are other steps we're taking, like eliminating the need for logging, but that's unlikely to eliminate all of the overhead.

Can it be fixed by doing an early-exit in the runtime initializer, writing out an empty .profraw?

I considered that initially, but that's less efficient than the approach implemented here, especially when it comes to binary size.

That raises a question about tooling support: some workflows (like the Xcode coverage plugin) might assume that a program compiled with -fprofile-instr-generate always creates a .profraw. If there's no profile written at all for the #counters == 0 case, that could be a breaking change.

That's a good point, would it be better to put this behind a (frontend or backend) flag?

vsk added a comment.Mar 9 2021, 4:37 PM
In D98061#2615334, @vsk wrote:
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

What kind of impact do you see? If #counters > 0, is it mostly binary size cost? If #counters == 0, what's the avg. overhead of writing out the full profile?

It depends a bit on the runtime and the platform. In Fuchsia where we always use the continuous mode, there's quite a bit of upfront cost (see https://github.com/llvm/llvm-project/blob/75f3f778052cdcd89e79f7a42a50915ee5ce2281/compiler-rt/lib/profile/InstrProfilingPlatformFuchsia.c#L109), we need to allocate a memory object, map it into address space, publish it, write some additional information into the log.

While it may not be so bad for a single binary, over hundreds of tests it adds up and with this change we saw the total test execution time go down from 30 to 18 minutes. There are other steps we're taking, like eliminating the need for logging, but that's unlikely to eliminate all of the overhead.

Can it be fixed by doing an early-exit in the runtime initializer, writing out an empty .profraw?

I considered that initially, but that's less efficient than the approach implemented here, especially when it comes to binary size.

I don't follow where the binary size cost comes from (is it the cost of writing out many empty .profraw headers?), but it sounds like the 30 -> 18min speed up is achieved by not registering essentially an empty profile with the VM, so it does follow that unconditionally writing out an empty .profraw won't work as well as disabling the runtime initializer entirely.

That raises a question about tooling support: some workflows (like the Xcode coverage plugin) might assume that a program compiled with -fprofile-instr-generate always creates a .profraw. If there's no profile written at all for the #counters == 0 case, that could be a breaking change.

That's a good point, would it be better to put this behind a (frontend or backend) flag?

I don't think it should be an option because I have doubts about how discoverable it'd be. My preference would be to add a section to the clang coverage docs explaining the different guarantees for .profraw emission.

Is the case when there is no counters very rare? And for those cases, how much overhead the runtime hook can incur? I assume it is small compared with actual instrumentation?

Is the case when there is no counters very rare? And for those cases, how much overhead the runtime hook can incur? I assume it is small compared with actual instrumentation?

I'll try to explain our use case in more detail, hopefully that'll make the issue more clear.

We're collecting coverage during pre-submit testing and we're trying to reduce the overhead of instrumentation to provide fast turnaround time, so we're only instrumenting modified code. So if a patch modifies the header foo.h, we'd generate profile.list with the following content:

src:include/foo.h

and set the global flag -fprofile-list=profile.list (because we don't know what are all the places where foo.h is included). For binaries that contain TUs that include foo.h, those TUs get instrumented and the profile runtime is linked in into the binary as expected.

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

This change is trying to address the issue by declaring __llvm_profile_runtime only when it's needed (that is, only when TU contains some instrumented functions) rather than doing it unconditionally in the driver where we don't yet know if the runtime will be needed or not.

thanks for the background. This patch looks good at higher level. Vedant can help detailed review.

vsk accepted this revision.Mar 10 2021, 12:11 PM

Thanks for the detailed explanation of the -fprofile-list workflow; given the difference constraints, this patch lgtm. Please document the divergent behavior re: no .profraw file when #counters == 0 for non-MachO in the clang docs.

This revision is now accepted and ready to land.Mar 10 2021, 12:11 PM
phosek updated this revision to Diff 329875.Mar 11 2021, 1:26 AM
This revision was automatically updated to reflect the committed changes.
phosek reopened this revision.Mar 12 2021, 2:34 PM
This revision is now accepted and ready to land.Mar 12 2021, 2:34 PM
MaskRay added a subscriber: MaskRay.EditedMar 12 2021, 7:06 PM

I am a bit concerned that whether the file is generated or not is now dependent on the instrumentation and linker garbage collection.

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

The overhead is just __llvm_profile_write_file, right? It just writes a 100+ bytes file which has very little overhead.

Some sanitizers can be used in a link-only manner without instrumentation, e.g. -fsanitize=leak does not need instrumentation. The source code just loses __has_feature(leak_sanitizer) detection.
Link-only -fsanitize=address can catch double free and mismatching new/delete.

Do we expect that libclang_rt.profile- can provide other features which may be useful even if there is nothing to instrument according to -fprofile-list?
If yes, making the library conditionally not linked can lose such features.

Another case is ld.lld --thinlto-index-only always creates *.imports and *.thinlto.bc files, to convey to the build system that the files are correctly generated.

phosek updated this revision to Diff 332739.Mar 23 2021, 11:23 AM

I am a bit concerned that whether the file is generated or not is now dependent on the instrumentation and linker garbage collection.

That's a fair concern, do you know about use cases where this would cause issues?

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

The overhead is just __llvm_profile_write_file, right? It just writes a 100+ bytes file which has very little overhead.

It could be more if you use the continuous mode where you'd also need to mmap the profile and do some additional setup.

Some sanitizers can be used in a link-only manner without instrumentation, e.g. -fsanitize=leak does not need instrumentation. The source code just loses __has_feature(leak_sanitizer) detection.
Link-only -fsanitize=address can catch double free and mismatching new/delete.

Do we expect that libclang_rt.profile- can provide other features which may be useful even if there is nothing to instrument according to -fprofile-list?

I'm not aware of any such features being planned right now.

If yes, making the library conditionally not linked can lose such features.

Another case is ld.lld --thinlto-index-only always creates *.imports and *.thinlto.bc files, to convey to the build system that the files are correctly generated.

Since each instrumented binary (that is executable and shared library) generates its own profile, how many profiles get generated really depends on how many instrumented shared libraries your binary depends on. Furthermore, if you use profile merging, you cannot even predict what the profile names are going to be, so I assumed that build systems/test runners would need some other mechanism to find out what profiles were generated, at least that's the case for us.

This approach was already used prior to 9a041a75221ca, but we changed it to always generate the llvm_profile_runtime due to a TAPI limitation.

D43794 (rG9a041a75221ca) does not seem to affect the ELF behavior.

We can stop passing -u__llvm_profile_runtime to the linker on Linux and Fuchsia since the generated undefined symbol in each translation unit that needed it serves the same purpose.

This restores the behavior before @davidxl's rG170cd100ed6f38ec5826dbd1bd6930ddfd3490a4 (2015).
While having less code in the clang driver side is pros to me, I don't know whether it should come with the cost of conditional presence/absence of the .profraw file.
(you can specify LLVM_PROFILE_FILE without %m; -fprofile-profile-generate by default does not use %m)

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

The overhead is just __llvm_profile_write_file, right? It just writes a 100+ bytes file which has very little overhead.

It could be more if you use the continuous mode where you'd also need to mmap the profile and do some additional setup.

Can the cost be avoided in the runtime?

llvm/test/Instrumentation/InstrProfiling/linkage.ll
13

If the symbol is now present, a CHECK line should be kept.