This is an archive of the discontinued LLVM Phabricator instance.

[InstrProfiling] Generate runtime hook for Fuchsia
ClosedPublic

Authored by phosek on Mar 5 2021, 10:49 AM.

Details

Summary

When none of the translation units in the binary have been instrumented
we shouldn't need to link the profile runtime. However, because we pass
-u__llvm_profile_runtime on Linux and Fuchsia, the runtime would still
be pulled in and incur some overhead. On Fuchsia which uses runtime
counter relocation, it also means that we cannot reference the bias
variable unconditionally.

This change modifies the InstrProfiling pass to pull in the profile
runtime only when needed by declaring the llvm_profile_runtime symbol
in the translation unit only when needed. For now we restrict this only
for Fuchsia, but this can be later expanded to other platforms. This
approach was already used prior to 9a041a75221ca, but we changed it
to always generate the
llvm_profile_runtime due to a TAPI limitation,
but that limitation may no longer apply, and it certainly doesn't apply
on platforms like Fuchsia.

Diff Detail

Event Timeline

phosek created this revision.Mar 5 2021, 10:49 AM
phosek requested review of this revision.Mar 5 2021, 10:49 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 5 2021, 10:49 AM
phosek added inline comments.Mar 5 2021, 10:50 AM
llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1138–1139

@vsk do you know why we need this function instead of just using llvm.compiler.used/llvm.used for the symbol? I used that approach for ELF and it seems to be working fine.

phosek added a comment.Mar 9 2021, 2:46 PM

@vsk do you have any thoughts on this?

vsk added a subscriber: ributzka.Mar 9 2021, 2:57 PM

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1138–1139

I don't have the context for this, since this code is from before I started working on llvm. I'm guessing, but maybe it's possible that llvm(.compiler)?.used didn't exist or work well when this code was written.

phosek marked an inline comment as done.Mar 9 2021, 3:07 PM
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1138–1139

Would it be OK with you if I sent out a separate change to remove this?

vsk added a comment.Mar 9 2021, 3:28 PM
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

What kind of impact do you see? If #counters > 0, is it mostly binary size cost? If #counters == 0, what's the avg. overhead of writing out the full profile? Can it be fixed by doing an early-exit in the runtime initializer, writing out an empty .profraw?

That raises a question about tooling support: some workflows (like the Xcode coverage plugin) might assume that a program compiled with -fprofile-instr-generate always creates a .profraw. If there's no profile written at all for the #counters == 0 case, that could be a breaking change.

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
1138–1139

Thanks, yes, that would be great.

phosek marked an inline comment as done.Mar 9 2021, 3:56 PM
In D98061#2615334, @vsk wrote:
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

What kind of impact do you see? If #counters > 0, is it mostly binary size cost? If #counters == 0, what's the avg. overhead of writing out the full profile?

It depends a bit on the runtime and the platform. In Fuchsia where we always use the continuous mode, there's quite a bit of upfront cost (see https://github.com/llvm/llvm-project/blob/75f3f778052cdcd89e79f7a42a50915ee5ce2281/compiler-rt/lib/profile/InstrProfilingPlatformFuchsia.c#L109), we need to allocate a memory object, map it into address space, publish it, write some additional information into the log.

While it may not be so bad for a single binary, over hundreds of tests it adds up and with this change we saw the total test execution time go down from 30 to 18 minutes. There are other steps we're taking, like eliminating the need for logging, but that's unlikely to eliminate all of the overhead.

Can it be fixed by doing an early-exit in the runtime initializer, writing out an empty .profraw?

I considered that initially, but that's less efficient than the approach implemented here, especially when it comes to binary size.

That raises a question about tooling support: some workflows (like the Xcode coverage plugin) might assume that a program compiled with -fprofile-instr-generate always creates a .profraw. If there's no profile written at all for the #counters == 0 case, that could be a breaking change.

That's a good point, would it be better to put this behind a (frontend or backend) flag?

vsk added a comment.Mar 9 2021, 4:37 PM
In D98061#2615334, @vsk wrote:
In D98061#2615239, @vsk wrote:

@ributzka may have stronger thoughts about when -fprofile-instr-generate must imply that a known set of symbols appear with external visibility. Up until now, the answer has been "always", and this is what tapi enforces for MachO. It's awkward to have this be inconsistent between MachO/ELF, but if there's a compelling performance reason then I think it's fine.

From the perspective of Fuchsia, we've seen a noticeable impact of this change when using -fprofile-instr-generate together -fprofile-list to apply instrumentation selectively only to modified files.

What kind of impact do you see? If #counters > 0, is it mostly binary size cost? If #counters == 0, what's the avg. overhead of writing out the full profile?

It depends a bit on the runtime and the platform. In Fuchsia where we always use the continuous mode, there's quite a bit of upfront cost (see https://github.com/llvm/llvm-project/blob/75f3f778052cdcd89e79f7a42a50915ee5ce2281/compiler-rt/lib/profile/InstrProfilingPlatformFuchsia.c#L109), we need to allocate a memory object, map it into address space, publish it, write some additional information into the log.

While it may not be so bad for a single binary, over hundreds of tests it adds up and with this change we saw the total test execution time go down from 30 to 18 minutes. There are other steps we're taking, like eliminating the need for logging, but that's unlikely to eliminate all of the overhead.

Can it be fixed by doing an early-exit in the runtime initializer, writing out an empty .profraw?

I considered that initially, but that's less efficient than the approach implemented here, especially when it comes to binary size.

I don't follow where the binary size cost comes from (is it the cost of writing out many empty .profraw headers?), but it sounds like the 30 -> 18min speed up is achieved by not registering essentially an empty profile with the VM, so it does follow that unconditionally writing out an empty .profraw won't work as well as disabling the runtime initializer entirely.

That raises a question about tooling support: some workflows (like the Xcode coverage plugin) might assume that a program compiled with -fprofile-instr-generate always creates a .profraw. If there's no profile written at all for the #counters == 0 case, that could be a breaking change.

That's a good point, would it be better to put this behind a (frontend or backend) flag?

I don't think it should be an option because I have doubts about how discoverable it'd be. My preference would be to add a section to the clang coverage docs explaining the different guarantees for .profraw emission.

Is the case when there is no counters very rare? And for those cases, how much overhead the runtime hook can incur? I assume it is small compared with actual instrumentation?

Is the case when there is no counters very rare? And for those cases, how much overhead the runtime hook can incur? I assume it is small compared with actual instrumentation?

I'll try to explain our use case in more detail, hopefully that'll make the issue more clear.

We're collecting coverage during pre-submit testing and we're trying to reduce the overhead of instrumentation to provide fast turnaround time, so we're only instrumenting modified code. So if a patch modifies the header foo.h, we'd generate profile.list with the following content:

src:include/foo.h

and set the global flag -fprofile-list=profile.list (because we don't know what are all the places where foo.h is included). For binaries that contain TUs that include foo.h, those TUs get instrumented and the profile runtime is linked in into the binary as expected.

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

This change is trying to address the issue by declaring __llvm_profile_runtime only when it's needed (that is, only when TU contains some instrumented functions) rather than doing it unconditionally in the driver where we don't yet know if the runtime will be needed or not.

thanks for the background. This patch looks good at higher level. Vedant can help detailed review.

vsk accepted this revision.Mar 10 2021, 12:11 PM

Thanks for the detailed explanation of the -fprofile-list workflow; given the difference constraints, this patch lgtm. Please document the divergent behavior re: no .profraw file when #counters == 0 for non-MachO in the clang docs.

This revision is now accepted and ready to land.Mar 10 2021, 12:11 PM
phosek updated this revision to Diff 329875.Mar 11 2021, 1:26 AM
This revision was automatically updated to reflect the committed changes.
phosek reopened this revision.Mar 12 2021, 2:34 PM
This revision is now accepted and ready to land.Mar 12 2021, 2:34 PM
MaskRay added a subscriber: MaskRay.EditedMar 12 2021, 7:06 PM

I am a bit concerned that whether the file is generated or not is now dependent on the instrumentation and linker garbage collection.

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

The overhead is just __llvm_profile_write_file, right? It just writes a 100+ bytes file which has very little overhead.

Some sanitizers can be used in a link-only manner without instrumentation, e.g. -fsanitize=leak does not need instrumentation. The source code just loses __has_feature(leak_sanitizer) detection.
Link-only -fsanitize=address can catch double free and mismatching new/delete.

Do we expect that libclang_rt.profile- can provide other features which may be useful even if there is nothing to instrument according to -fprofile-list?
If yes, making the library conditionally not linked can lose such features.

Another case is ld.lld --thinlto-index-only always creates *.imports and *.thinlto.bc files, to convey to the build system that the files are correctly generated.

phosek updated this revision to Diff 332739.Mar 23 2021, 11:23 AM

I am a bit concerned that whether the file is generated or not is now dependent on the instrumentation and linker garbage collection.

That's a fair concern, do you know about use cases where this would cause issues?

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

The overhead is just __llvm_profile_write_file, right? It just writes a 100+ bytes file which has very little overhead.

It could be more if you use the continuous mode where you'd also need to mmap the profile and do some additional setup.

Some sanitizers can be used in a link-only manner without instrumentation, e.g. -fsanitize=leak does not need instrumentation. The source code just loses __has_feature(leak_sanitizer) detection.
Link-only -fsanitize=address can catch double free and mismatching new/delete.

Do we expect that libclang_rt.profile- can provide other features which may be useful even if there is nothing to instrument according to -fprofile-list?

I'm not aware of any such features being planned right now.

If yes, making the library conditionally not linked can lose such features.

Another case is ld.lld --thinlto-index-only always creates *.imports and *.thinlto.bc files, to convey to the build system that the files are correctly generated.

Since each instrumented binary (that is executable and shared library) generates its own profile, how many profiles get generated really depends on how many instrumented shared libraries your binary depends on. Furthermore, if you use profile merging, you cannot even predict what the profile names are going to be, so I assumed that build systems/test runners would need some other mechanism to find out what profiles were generated, at least that's the case for us.

This approach was already used prior to 9a041a75221ca, but we changed it to always generate the llvm_profile_runtime due to a TAPI limitation.

D43794 (rG9a041a75221ca) does not seem to affect the ELF behavior.

We can stop passing -u__llvm_profile_runtime to the linker on Linux and Fuchsia since the generated undefined symbol in each translation unit that needed it serves the same purpose.

This restores the behavior before @davidxl's rG170cd100ed6f38ec5826dbd1bd6930ddfd3490a4 (2015).
While having less code in the clang driver side is pros to me, I don't know whether it should come with the cost of conditional presence/absence of the .profraw file.
(you can specify LLVM_PROFILE_FILE without %m; -fprofile-profile-generate by default does not use %m)

There are also going to be binaries in our system build that don't use foo.h and ideally shouldn't have any overhead since nothing inside them is instrumented (they should behave as if -fprofile-instr-generate wasn't set for them at all). That's not the case today because if you set -fprofile-instr-generate, driver passes -u__llvm_profile_runtime to the linker which "pulls in" the profile runtime introducing some extra bloat and startup overhead I described earlier.

The overhead is just __llvm_profile_write_file, right? It just writes a 100+ bytes file which has very little overhead.

It could be more if you use the continuous mode where you'd also need to mmap the profile and do some additional setup.

Can the cost be avoided in the runtime?

llvm/test/Instrumentation/InstrProfiling/linkage.ll
13 ↗(On Diff #332739)

If the symbol is now present, a CHECK line should be kept.

phosek updated this revision to Diff 365635.Aug 10 2021, 5:50 PM
phosek retitled this revision from [InstrProfiling] Generate runtime hook for ELF platforms to [InstrProfiling] Generate runtime hook for Fuchsia.
phosek edited the summary of this revision. (Show Details)

I have reworked the change and restricted it only to Fuchsia for now which is where this change is still desirable.

This revision was landed with ongoing or failed builds.Aug 10 2021, 11:21 PM
This revision was automatically updated to reflect the committed changes.
kile added a subscriber: kile.Aug 16 2021, 11:27 AM

Hi Petr,

This looks to be the change that most likely broke a test on Windows Debug - would you mind taking a look? Here's the relevant test and stack trace:

FAIL: LLVM :: Instrumentation/InstrProfiling/linkage.ll (56524 of 77140)

  • TEST 'LLVM :: Instrumentation/InstrProfiling/linkage.ll' FAILED ****

Script:

: 'RUN: at line 3'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-apple-macosx10.10.0 -instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=MACHO
: 'RUN: at line 4'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-unknown-linux -instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=ELF
: 'RUN: at line 5'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-unknown-fuchsia -instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=ELF
: 'RUN: at line 6'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-pc-win32-coff -instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=COFF
: 'RUN: at line 7'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-apple-macosx10.10.0 -passes=instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=MACHO
: 'RUN: at line 8'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-unknown-linux -passes=instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=ELF
: 'RUN: at line 9'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-unknown-fuchsia -passes=instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=ELF
: 'RUN: at line 10'; d:\a\_work\1\b\llvm\debug\bin\opt.exe < D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll -mtriple=x86_64-pc-win32-coff -passes=instrprof -S | d:\a\_work\1\b\llvm\debug\bin\filecheck.exe D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll --check-prefixes=COFF

Exit Code: 2

Command Output (stdout):

$ ":" "RUN: at line 3"
$ "d:\a\_work\1\b\llvm\debug\bin\opt.exe" "-mtriple=x86_64-apple-macosx10.10.0" "-instrprof" "-S"
$ "d:\a\_work\1\b\llvm\debug\bin\filecheck.exe" "D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll" "--check-prefixes=MACHO"
$ ":" "RUN: at line 4"
$ "d:\a\_work\1\b\llvm\debug\bin\opt.exe" "-mtriple=x86_64-unknown-linux" "-instrprof" "-S"
$ "d:\a\_work\1\b\llvm\debug\bin\filecheck.exe" "D:\a\_work\1\s\llvm-project\llvm\test\Instrumentation\InstrProfiling\linkage.ll" "--check-prefixes=ELF"
$ ":" "RUN: at line 5"
$ "d:\a\_work\1\b\llvm\debug\bin\opt.exe" "-mtriple=x86_64-unknown-fuchsia" "-instrprof" "-S"

command stderr:

PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.

Stack dump:

0. Program arguments: d:\\a\\_work\\1\\b\\llvm\\debug\\bin\\opt.exe -mtriple=x86_64-unknown-fuchsia -instrprof -S

#0 0x00007ff6c05a6844 failwithmessage d:\agent\_work\4\s\src\vctools\crt\vcstartup\src\rtc\error.cpp:213:0

#1 0x00007ff6c05a69e4 _RTC_UninitUse d:\agent\_work\4\s\src\vctools\crt\vcstartup\src\rtc\error.cpp:362:0

#2 0x00007ff6be864982 llvm::InstrProfiling::run(class llvm::Module &, class std::function<(class llvm::Function &)>) D:\a\_work\1\s\llvm-project\llvm\lib\Transforms\Instrumentation\InstrProfiling.cpp:590:0

#3 0x00007ff6be8644ad llvm::InstrProfiling::run(class llvm::Module &, class llvm::AnalysisManager<class llvm::Module> &) D:\a\_work\1\s\llvm-project\llvm\lib\Transforms\Instrumentation\InstrProfiling.cpp:417:0

#4 0x00007ff6bf82d229 llvm::detail::PassModel<class llvm::Module, class llvm::InstrProfiling, class llvm::PreservedAnalyses, class llvm::AnalysisManager<class llvm::Module>>::run(class llvm::Module &, class llvm::AnalysisManager<class llvm::Module> &) D:\a\_work\1\s\llvm-project\llvm\include\llvm\IR\PassManagerInternal.h:85:0

#5 0x00007ff6bc402b9e llvm::PassManager<class llvm::Module, class llvm::AnalysisManager<class llvm::Module>>::run(class llvm::Module &, class llvm::AnalysisManager<class llvm::Module> &) D:\a\_work\1\s\llvm-project\llvm\include\llvm\IR\PassManager.h:509:0

#6 0x00007ff6bc3bd844 llvm::runPassPipeline(class llvm::StringRef, class llvm::Module &, class llvm::TargetMachine *, class llvm::TargetLibraryInfoImpl *, class llvm::ToolOutputFile *, class llvm::ToolOutputFile *, class llvm::ToolOutputFile *, class llvm::StringRef, class llvm::ArrayRef<class llvm::StringRef>, enum llvm::opt_tool::OutputKind, enum llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool) D:\a\_work\1\s\llvm-project\llvm\tools\opt\NewPMDriver.cpp:456:0

#7 0x00007ff6bc417060 main D:\a\_work\1\s\llvm-project\llvm\tools\opt\opt.cpp:830:0

#8 0x00007ff6c05a6039 invoke_main d:\agent\_work\4\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:79:0

kile added a comment.Aug 16 2021, 3:17 PM

Apologies, reporting on the incorrect revision :(