Page MenuHomePhabricator

Machine IR Profile
Needs ReviewPublic

Authored by ellis on Thu, Jun 10, 2:15 PM.

Details

Reviewers
alexshap
Summary

Machine IR Profile (MIP)

tl;dr;

This is a proposal to introduce a new instrumentation pass that can produce optimization profiles with a focus on binary size and runtime performance of the instrumented binaries.

Our instrumented binaries record machine function call counts, machine function timestamps, machine basic block coverage, and a subset of the dynamic call graph. There is also a more lightweight mode that only collects machine function coverage data that has negligible runtime overhead and a binary size increase of 2-5% for instrumented binaries.

This is just the first patch of the WIP MIP project. The full branch can be found at https://github.com/ellishg/llvm-project

Motivation

In the mobile space, increasing binary size has an outsized impact on both runtime performance and download speed. Current instrumentation implementations such as XRay and GCov produce binaries that are too slow and too large to run on real mobile devices. We propose a new pass that injects instrumentation code at the machine ir level. At runtime, we write profile data to our custom __llvm_mipraw section that is eventually dumped to a .mipraw file. At buildtime, we emit a .mipmap file which we use to map function information to data in the .mipraw file. The result is that no redundant function info is stored in the binary, which allows our instrumentation to have minimal size overhead.

MIP has been implemented on ELF and Mach-O targets for x86_64, AArch64, and Armv7 with Thumb and Thumb2.

Performance

Our focus for now is on the performance and size of binaries that have injected instrumentation instead of binaries that have been optimized with instrumentation profiles. We collected some basic results from MultiSource/Benchmarks in llvm-test-suite for both MIP and clang’s instrumentation using the -fprofile-generate flag. It should be noted that this comparison is not fair because clang’s instrumentation collects much more data than just function coverage. However, we expect fully-featured MIP to have similar metrics.

Instrumented Binary Size

At the moment, we have implemented function coverage which injects one x86_64 instruction (7 bytes) and one byte of global data for each instrumented function, which should have minimal impact on binary size and runtime performance. In fact, our results show that we should expect MIP instrumented binaries to be only 2-5% larger. We contrast this with clang’s instrumentation, which can increase the binary size by 500-900%.

Instrumented Execution Time

We found that MIP had negligable execution time regressions when instrumented with MIP. Again, we can (unfairly) contrast this to -fprofile-generate which increased execution time by 1-40%.

Usage

We use the -fmachine-profile-generate clang flag to produce an instrumented binary and then use llvm-objcopy to extract the .mipmap file.

$ clang -g -fmachine-profile-generate main.cpp
$ llvm-objcopy --dump-section=__llvm_mipmap=default.mipmap a.out /dev/null
$ llvm-strip -g a.out -o a.out.stripped

This will produce the instrumented binary a.out and a map file default.mipmap.

When we run the binary, it will produce a default.mipraw file containing the profile data for that run.

$ ./a.out.stripped
$ ls
a.out    default.mipmap    default.mipraw    main.cpp

Then we use our custom tool to postprocess the raw profile and produce the final profile default.mip.

$ llvm-mipdata create -p default.mip default.mipmap
$ llvm-mipdata merge -p default.mip default.mipraw

If our binary has debug info, we can use it to report source information along with the profile data.

$ llvm-mipdata show -p default.mip --debug a.out
_Z3fooi
  Source Info: /home/main.cpp:9
  Call Count: 0
  Block Coverage:
     COLD COLD COLD COLD COLD

_Z3bari
  Source Info: /home/main.cpp:16
  Call Count: 1
  Block Coverage:
     HOT  HOT  COLD HOT  HOT

Finally, we can consume the profile using the clang flag -fmachine-profile-use= to produce a profile-optimized binary.

$ clang -fmachine-profile-use=default.mip main.cpp

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ellis requested review of this revision.Thu, Jun 10, 2:15 PM
ellis created this object with visibility "ellis (Ellis Hoag)".
ellis created this object with edit policy "ellis (Ellis Hoag)".
Herald added a project: Restricted Project. · View Herald TranscriptThu, Jun 10, 2:15 PM
ellis edited the summary of this revision. (Show Details)Thu, Jun 10, 2:17 PM
ellis edited the summary of this revision. (Show Details)
ellis edited the summary of this revision. (Show Details)Thu, Jun 10, 3:03 PM
ellis changed the visibility from "ellis (Ellis Hoag)" to "Public (No Login Required)".
ellis changed the edit policy from "ellis (Ellis Hoag)" to "All Users".
ellis added a comment.EditedThu, Jun 10, 3:11 PM

Machine IR Profile (MIP)

tl;dr;

This is a proposal to introduce a new instrumentation pass that can produce optimization profiles with a focus on binary size and runtime performance of the instrumented binaries.

Our instrumented binaries record machine function call counts, machine function timestamps, machine basic block coverage, and a subset of the dynamic call graph. There is also a more lightweight mode that only collects machine function coverage data that has negligible runtime overhead and a binary size increase of 2-5% for instrumented binaries.

This is just the first patch of the WIP MIP project. The full branch can be found at https://github.com/ellishg/llvm-project

Motivation

In the mobile space, increasing binary size has an outsized impact on both runtime performance and download speed. Current instrumentation implementations such as XRay and GCov produce binaries that are too slow and too large to run on real mobile devices. We propose a new pass that injects instrumentation code at the machine ir level. At runtime, we write profile data to our custom __llvm_mipraw section that is eventually dumped to a .mipraw file. At buildtime, we emit a .mipmap file which we use to map function information to data in the .mipraw file. The result is that no redundant function info is stored in the binary, which allows our instrumentation to have minimal size overhead.

MIP has been implemented on ELF and Mach-O targets for x86_64, AArch64, and Armv7 with Thumb and Thumb2.

Performance

Our focus for now is on the performance and size of binaries that have injected instrumentation instead of binaries that have been optimized with instrumentation profiles. We collected some basic results from MultiSource/Benchmarks in llvm-test-suite for both MIP and clang’s instrumentation using the -fprofile-generate flag. It should be noted that this comparison is not fair because clang’s instrumentation collects much more data than just function coverage. However, we expect fully-featured MIP to have similar metrics.

Instrumented Binary Size

At the moment, we have implemented function coverage which injects one x86_64 instruction (7 bytes) and one byte of global data for each instrumented function, which should have minimal impact on binary size and runtime performance. In fact, our results show that we should expect MIP instrumented binaries to be only 2-5% larger. We contrast this with clang’s instrumentation, which can increase the binary size by 500-900%.

Instrumented Execution Time

We found that MIP had negligable execution time regressions when instrumented with MIP. Again, we can (unfairly) contrast this to -fprofile-generate which increased execution time by 1-40%.

Usage

We use the -fmachine-profile-generate clang flag to produce an instrumented binary and then use llvm-objcopy to extract the .mipmap file.

$ clang -g -fmachine-profile-generate main.cpp
$ llvm-objcopy --dump-section=__llvm_mipmap=default.mipmap a.out /dev/null
$ llvm-strip -g a.out -o a.out.stripped

This will produce the instrumented binary a.out and a map file default.mipmap.

When we run the binary, it will produce a default.mipraw file containing the profile data for that run.

$ ./a.out.stripped
$ ls
a.out    default.mipmap    default.mipraw    main.cpp

Then we use our custom tool to postprocess the raw profile and produce the final profile default.mip.

$ llvm-mipdata create -p default.mip default.mipmap
$ llvm-mipdata merge -p default.mip default.mipraw

If our binary has debug info, we can use it to report source information along with the profile data.

$ llvm-mipdata show -p default.mip --debug a.out
_Z3fooi
  Source Info: /home/main.cpp:9
  Call Count: 0
  Block Coverage:
     COLD COLD COLD COLD COLD

_Z3bari
  Source Info: /home/main.cpp:16
  Call Count: 1
  Block Coverage:
     HOT  HOT  COLD HOT  HOT

Finally, we can consume the profile using the clang flag -fmachine-profile-use= to produce a profile-optimized binary.

$ clang -fmachine-profile-use=default.mip main.cpp
bcain added a subscriber: bcain.Thu, Jun 10, 3:28 PM
kyulee added a subscriber: kyulee.Thu, Jun 10, 3:30 PM
ellis added a subscriber: dberris.Thu, Jun 10, 3:38 PM

Very interesting work. A high level question, the CSFDO/CSPGO instruments the program after the inlining pass, so it can provide the same level of coverage for machine IR. What additional information do we expect from MIR profile that is not available in CSFDO?

ellis added a comment.Thu, Jun 10, 4:13 PM

Very interesting work. A high level question, the CSFDO/CSPGO instruments the program after the inlining pass, so it can provide the same level of coverage for machine IR. What additional information do we expect from MIR profile that is not available in CSFDO?

One of the main features of MIP is that the instrumented binaries do not contain any excess metadata so they are very small. In fact, all the metadata is extracted out into a .mipmap file. Later diffs (https://github.com/ellishg/llvm-project) add more profile data including function call counts, function call order, and samples of the return address register which we can use to generate a dynamic call graph that includes dynamic dispatch calls.

Can you compare the instrumentation overhead with -fcs-profile-generate? (as with -fprofile-generate). You may also want to disable value profiling (which can be expensive) in -fcs-profile-generate.

In the MIP description, it mentions that the counter size is 1 byte. This is good enough for coverage testing, but not enough to track hotness.

vsk added a subscriber: vsk.Thu, Jun 10, 4:42 PM

Thanks for sharing this! Specifically re: the -fprofile-generate v. MIP comparison for code coverage, perhaps it would be more fair to compare against -finstrument-function-entry-bare. The latter only adds one call per function, whereas -fprofile-generate can add instrument a function in more than one place.

ellis added a comment.Thu, Jun 10, 4:44 PM

In the MIP description, it mentions that the counter size is 1 byte. This is good enough for coverage testing, but not enough to track hotness.

The counter is 1 byte when only function coverage is enabled (-fmachine-profile-function-coverage). There is another mode (-fmachine-profile-call-graph) which has two 4 byte counters, one for call counts, and the other for call order, which can be used for hotness.

lanza added a subscriber: lanza.Thu, Jun 10, 5:38 PM
ellis edited the summary of this revision. (Show Details)Thu, Jun 10, 6:25 PM
ellis updated this revision to Diff 351323.Thu, Jun 10, 6:27 PM
ellis edited the summary of this revision. (Show Details)

Fix small alignment bug in codegen

ellis added a comment.Thu, Jun 10, 7:05 PM

Can you compare the instrumentation overhead with -fcs-profile-generate? (as with -fprofile-generate). You may also want to disable value profiling (which can be expensive) in -fcs-profile-generate.

Thanks for sharing this! Specifically re: the -fprofile-generate v. MIP comparison for code coverage, perhaps it would be more fair to compare against -finstrument-function-entry-bare. The latter only adds one call per function, whereas -fprofile-generate can add instrument a function in more than one place.

@davidxl @vsk

Yes, I am planning to run some more simple benchmarks from llvm-test-suite for code size and performance of the instrumented binaries. I can compare MIP vs -finstrument-function-entry-bare, but are there any flags I should compare against?

You may compare the overhead with -fsanitize-coverage=func,inline-8bit-counters,pc-table. The latter only instruments the entry block of a function and uses a one-byte counter.

ellis added a comment.Fri, Jun 11, 3:46 PM

I've collected some size and runtime metrics from llvm-test-suite. The steps and results can be found in this gist: https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69

Keep in mind that a major goal is that MIP instrumented binaries should be as small as possible without sacrificing usability or functionality. The metrics that I measured in the gist are the binary size and the execution time of instrumented binaries. Like I said in the description, MIP instrumented binaries are usually 2-5% larger than non-instrumented binaries. We can compare this to the -fsanitize-coverage=func,inline-bool-flag,pc-table flag which seems to produce instrumented binaries that can be 100x larger. The extra size likely comes from the function metadata that is left in the binary, whereas MIP extracts all excess metadata out of the binary.

I also looked into comparing against -finstrument-function-entry-bare and -finstrument-functions. It seems that they inject calls to void __cyg_profile_func_enter_bare(); or void __cyg_profile_func_enter (void *this_fn, void *call_site); at the start of each function. The first flag doesn't really provide enough context know the caller and mark it as covered. The second does provide this information, but it still would require some work to collect profile data in a consumable way as MIP does. That being said, these flags seem provide similar size and runtime overhead as MIP, albeit with much less functionality.

It should also be noted that all this is only considering MIP function coverage. MIP can profile much more data including call counts and the dynamic call graph, likely with very similar performance metrics.

Do you have runtime number with -fcs-profile-generate -mllvm
-disable-vp=true

This will provide more data (BB count) than what MIP can do.

Besides, more options can be added to control the type of profile data
collected in IRPGO to further reduce overhead.

David

ellis added a comment.Fri, Jun 11, 4:40 PM

Do you have runtime number with -fcs-profile-generate -mllvm
-disable-vp=true

This will provide more data (BB count) than what MIP can do.

Besides, more options can be added to control the type of profile data
collected in IRPGO to further reduce overhead.

David

I've added those numbers here https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3777527

The flags -fcs-profile-generate -mllvm -disable-vp=true appears to increase binary size by 4-9x and execution time is increased in many cases.

ellis updated this revision to Diff 351601.Fri, Jun 11, 5:28 PM

Fix clang-tidy errors

ellis added a comment.Fri, Jun 11, 5:52 PM

@davidxl

Can you use the same set of benchmarks for comparison?

I'm not sure what you mean. In https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69 I used llvm-test-suite/MultiSource/Benchmarks/ for all tests.

  1. is MIP used in the test only for function coverage or have call counts collected as well?

Just function coverage for simplicity, but full MIP does not add very much to the size.

  1. for -fcs-profile-generate, the size increase may mostly come from the name section. Can you compare .text + .data + .bss

Sure, I can do this next week.

ellis updated this revision to Diff 351625.Fri, Jun 11, 8:12 PM

MIP instrumentation pass should happen before branch relaxation

ellis added a comment.Sat, Jun 12, 4:55 PM

@davidxl

I've added text size data to https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778109 and hopefully it is more clear. I randomly chose the test MultiSource/Benchmarks/FreeBench/fourinarow.test to show the size data. You can see the raw data in https://gist.github.com/ellishg/156639f24d728a88067f903cb53e1643

Base

"size..text": 4629
"size..data": 8,
"size..bss": 120,

MIP

"size..text": 4741,
"size..data": 8,
"size.__llvm_mipmap": 768,
"size.__llvm_mipraw": 41,
"size..bss": 120,

-fcs-profile-generate

"size..text": 24322,
"size..data": 176,
"size.__llvm_prf_cnts": 1336,
"size.__llvm_prf_data": 816,
"size.__llvm_prf_names": 117,
"size.__llvm_prf_vnds": 24576,
"size..bss": 8952,

Thanks for the data. With value profile disabled, __llvm_prf_vnds section
should not be emitted -- there is a bug in the compiler.

Thinking about MIP's use case a little, it seems that it actually matches
what xray does. Xray has very low runtime overhead and can be turned on
always : xra https://llvm.org/docs/XRay.htmly. Have you compare with xray
and consider using that?

David

ellis added a comment.

@davidxl

I've added text size data to
https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778109
and hopefully it is more clear. I randomly chose the test
MultiSource/Benchmarks/FreeBench/fourinarow.test to show the size data.
You can see the raw data in
https://gist.github.com/ellishg/156639f24d728a88067f903cb53e1643

Base

"size..text": 4629
"size..data": 8,
"size..bss": 120,

MIP

"size..text": 4741,
"size..data": 8,
"size.__llvm_mipmap": 768,
"size.__llvm_mipraw": 41,
"size..bss": 120,

-fcs-profile-generate

"size..text": 24322,
"size..data": 176,
"size.__llvm_prf_cnts": 1336,
"size.__llvm_prf_data": 816,
"size.__llvm_prf_names": 117,
"size.__llvm_prf_vnds": 24576,
"size..bss": 8952,

Repository:

rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION

https://reviews.llvm.org/D104060/new/

https://reviews.llvm.org/D104060

Thinking about MIP's use case a little, it seems that it actually matches
what xray does. Xray has very low runtime overhead and can be turned on
always : xra https://llvm.org/docs/XRay.htmly. Have you compare with xray
and consider using that?

In the space we care about, any increase in binary size regresses runtime performance. Yes there are other instrumentations that provide similar features with low runtime overhead, but MIP seems to be the only one that extracts out all metadata to minimize binary size overhead.

I've collected some section size data from the fourinarow test and have more size comparisons in https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778753
Please let me know if there are more flags I can use to turn off XRay features so that this is a more fair comparison.

XRay

-fxray-instrument -fxray-instrumentation-bundle=function-entry

size..text: 154802
size..data: 10468
size.xray_instr_map: 512
size..bss: 594144

Thinking about MIP's use case a little, it seems that it actually matches
what xray does. Xray has very low runtime overhead and can be turned on
always : xra https://llvm.org/docs/XRay.htmly. Have you compare with xray
and consider using that?

Perhaps xray could be improved to reduce code size overhead to have feature parity with MIP?

Thanks for sharing this! Specifically re: the -fprofile-generate v. MIP comparison for code coverage, perhaps it would be more fair to compare against -finstrument-function-entry-bare. The latter only adds one call per function, whereas -fprofile-generate can add instrument a function in more than one place.

+1

Some historical context, I wrote an instrumentation little while back https://reviews.llvm.org/D74362 which adds similar instrumentation for function entry but at the IR level.

David

ellis added a comment.

@davidxl

I've added text size data to
https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778109
and hopefully it is more clear. I randomly chose the test
MultiSource/Benchmarks/FreeBench/fourinarow.test to show the size data.
You can see the raw data in
https://gist.github.com/ellishg/156639f24d728a88067f903cb53e1643

Base

"size..text": 4629
"size..data": 8,
"size..bss": 120,

MIP

"size..text": 4741,
"size..data": 8,
"size.__llvm_mipmap": 768,
"size.__llvm_mipraw": 41,
"size..bss": 120,

-fcs-profile-generate

"size..text": 24322,
"size..data": 176,
"size.__llvm_prf_cnts": 1336,
"size.__llvm_prf_data": 816,
"size.__llvm_prf_names": 117,
"size.__llvm_prf_vnds": 24576,
"size..bss": 8952,

Repository:

rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION

https://reviews.llvm.org/D104060/new/

https://reviews.llvm.org/D104060

Yes there are other instrumentations that provide similar features with low runtime overhead, but MIP seems to be the only one that extracts out all metadata to minimize binary size overhead.

I echo this. There are instrumentations with similar features (e.g, code coverage) but they required to keep metadata in the final binary, The other/rest lightweight instrumentations (without metadata) are far from complete to match the feature that MIP provides.
The key differentiation of MIP is that the metadata (mipmap) is expressed with relative relocations, which can be extractable in any build target and platform, which took us non-trivial amount of time to make them right.
In theory, we could improve other existing insturmentations like MIP to extract metadata, but this may require a significant amount of changes in the architect and format.

MIP has achieved great size reduction for instrumented binary. My
understanding the savings are mainly from the following:

  1. Smaller counter size (1 byte or 4 byte instead of 8 byte for IR PGO)
  2. extractable per function metadata (mipmap). Using this technique may

increase object file size more (due to extra relocations), but will reduce
executable size.

I don't understand (in block coverage mode) how the .text size can be
reduced. I have not looked at the patch in detail, but the test case shows
that the counter update is simply a 'movb 0, $counter' even in non-coverage
mode, is that expected?

I think 2) is something worth introducing to IRPGO under a flag.

David

MaskRay added inline comments.Sun, Jun 13, 3:00 PM
llvm/lib/CodeGen/MIPSectionEmitter.cpp
61

I want to try out the patch but I have noticed some layering violation.

If you do a -DBUILD_SHARED_LIBS=on build, you'll get errors like

ld.lld: error: undefined symbol: llvm::AsmPrinter::createTempSymbol(llvm::Twine const&) const
>>> referenced by MIPSectionEmitter.cpp:60 (/home/maskray/llvm/llvm/lib/CodeGen/MIPSectionEmitter.cpp:60)
>>>               lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MIPSectionEmitter.cpp.o:(llvm::MIPSectionEmitter::runOnMachineFunctionStart(llvm::MachineFunction&))
>>> referenced by MIPSectionEmitter.cpp:128 (/home/maskray/llvm/llvm/lib/CodeGen/MIPSectionEmitter.cpp:128)
>>>               lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MIPSectionEmitter.cpp.o:(llvm::MIPSectionEmitter::emitMIPHeader(llvm::MachineProfile::MIPFileType))
>>> referenced by MIPSectionEmitter.cpp:237 (/home/maskray/llvm/llvm/lib/CodeGen/MIPSectionEmitter.cpp:237)
>>>               lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MIPSectionEmitter.cpp.o:(llvm::MIPSectionEmitter::emitMIPFunctionInfo(llvm::MIPSectionEmitter::MFInfo&))

because of -Wl,-z,defs

MIP has achieved great size reduction for instrumented binary. My
understanding the savings are mainly from the following:

  1. Smaller counter size (1 byte or 4 byte instead of 8 byte for IR PGO)
  2. extractable per function metadata (mipmap). Using this technique may

increase object file size more (due to extra relocations), but will reduce
executable size.

That is correct. For function coverage MIP inject only one movb instruction in x86 so that the total overhead for every function is 1 byte of global data + 7 bytes of text.

I don't understand (in block coverage mode) how the .text size can be
reduced. I have not looked at the patch in detail, but the test case shows
that the counter update is simply a 'movb 0, $counter' even in non-coverage
mode, is that expected?

Let me try to clear something up. The size tests I ran in https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69 were for function coverage only and block coverage disabled. For block coverage, we inject a single movb instruction and another global byte for each basic block so the size overhead is very similar to function coverage. Can you point me to the test case you are referring to?

I think 2) is something worth introducing to IRPGO under a flag.

David

ellis updated this revision to Diff 352006.Mon, Jun 14, 3:17 PM

Fix build failures when using -DDBUILD_SHARED_LIBS=On. Thanks @MaskRay

MaskRay added a comment.EditedMon, Jun 14, 4:33 PM

I have played with the patch.

-fmachine-profile-generate only inserts a __llvm_mip_call_counts_caller function call. There is no basic block instrumentation, so this is just the function entry count coverage mode.

-fmachine-profile-generate -fmachine-profile-function-coverage changes the __llvm_mip_call_counts_caller call to movb $0, counter(%rip).
So this matches the traditional binary coverage mode.
This mode is supported by clang -fsanitize-coverage=func,inline-bool-flag,pc-table.
inline-bool-flag uses a conditional set because IIUC under concurrency this is faster than a racy write.
If needed -fsanitize-coverage=inline-bool-flag can introduce a mode to use a racy write.

-fmachine-profile-generate -fmachine-profile-block-coverage inserts movb $0, counter(%rip) for machine basic blocks.
This is a vertex profile (less powerful than a edge profile).
This mode is supported by clang -fsanitize-coverage=edge,inline-bool-flag,pc-table
Mapping the information to source files will require debug info (-g1).

Traditional gcc/clang coverage features (-fprofile-arcs/-fprofile-instr-generate/-fprofile-generate) are all about edge profiles and use word-size counters.
If the size is a concern, it is probably reasonable to use 32-bit counters, but smaller counters may not be suitable for PGO.


__llvm_mip_call_counts_caller is slow.
It is a function with a custom call convention using RAX as the argument on x86-64.
The impl detail function saves and restores many vector registers.
I haven't studied why __llvm_mip_call_counts_caller is needed.


__llvm_prf_data (-fprofile-generate, -fprofile-instr-generate) vs __llvm_mipmap (-fmachine-profile-generate)

In the absence of value profiling, __llvm_prf_data uses:

.quad NameRef
.quad FuncHash
.quad .L__profc_fun
.quad fun   # need a dynamic relocation; used by raw profile reader
.quad 0     # value profiling
.long NumCounters
.long 0     # value profiling, 2 unused value sites

If we want to save size for small linked images, we can change some .quad to .long.
e.g. if the number of functions is smaller than 2**16 (or slightly larger), we can use a 32-bit hash.
.L__profc_fun can use .long if the size cannot overflow 32-bit.

Note that 2 fields are only used by value profiling.

__llvm_mipmap has these fields. I added an inline comment that -shared doesn't work.

        .section        __llvm_mipmap,"aw",@progbits
        .globl  _Z3fooPiS_$MAP
        .p2align        3
_Z3fooPiS_$MAP:
.Lref2:
  ### not sure why this is needed
        .long   __start___llvm_mipraw-.Lref2    # Raw Section Start PC Offset

  ##### this does not link in -fpic -shared mode
        .long   _Z3fooPiS_$RAW-.Lref2           # Raw Profile Symbol PC Offset

        .long   _Z3fooPiS_-.Lref2               # Function PC Offset
        .long   .Lmip_func_end0-_Z3fooPiS_      # Function Size
        .long   0x0                             # CFG Signature
        .long   0                               # Non-entry Block Count
        .long   10                              # Function Name Length
        .ascii  "_Z3fooPiS_"

So the existing instrumentations are either in Clang or LLVM IR. Sort them by their position in the compiler pipeline:

  • -fprofile-instr-generate: clang CodeGen. Vew few optimizations can be applied.
  • -fprofile-arcs: very early in the optimization pipeline. The coverage information is from debug info, so being early can have debug info with fidelity.
  • -fcs-profile-generate: early in the optimization pipeline, before inlining
  • -fprofile-generate: somewhat early in the optimization pipeline, after inlining

My understanding about -fprofile-instr-generate vs -fprofile-generate.

In clang you have full information about line/column/region information.
So -fprofile-instr-generate works well for coverage.

At IR level, we can use debug info to convey some source-level information to IR but some information is lost (e.g. column information), so clang -fprofile-arcs is less accurate.

However, the frontend is not at a good position applying various optimizations.

For instance, the important Kirchoff's circult law (aka spanning tree) optimization is not implemented in -fprofile-instr-generate. (I added the optimization to clang -fprofile-arcs).
So in bad cases (e.g. libvpx) -fprofile-instr-generate can be 15% slower than -fprofile-arcs/-fprofile-generate.
(
As of why it is so efficient: normally the edge count |E| is only slightly larger than the vertex count |V|. There is a large difference between |E|-|V|+1 and |E|.

For instance, a no-branch function just needs one counter.
)

The loop optimization (instead of adding a counter N times, add N to it) cannot be enabled.
The benefit is relatively small, though.

The frontend cannot apply inlining or some early optimizations to greatly decrease the number of counters.

Now this patch series adds machine basic blocks instrumentation.
I wonder what it can do while the regular IR instrumentation cannot.

Machine basic block instrumentation has some awkward points.
Semantic information is lost. The loop optimization definitely cannot be applied.
If an IR basic block maps to multiple machine basic blocks, you need multiple increments for each MBB while with IR BB you may just need one (e.g. dominator).
Edge profiling is tricky. Edge profiling requires splitting critical edges - it is not clear how you can do this after the machine basic block layout is finalized.

Good summary.

For instance, the important Kirichoff's circult law (aka spanning tree)
optimization is not implemented. (I added the optimization to clang
-fprofile-generate).

You probably meant -fprofile-instr-generate :)

So in bad cases (e.g. libvpx) -fprofile-instr-generate can be 15% slower
than -fprofile-arcs/-fprofile-generate.

The loop optimization (instead of adding a counter N times, add N to it)
cannot be enabled.
The benefit is relatively small, though.

The frontend cannot apply inlining or some early optimizations to greatly
decrease the number of counters.

Instrumenting machine basic blocks feels awkward to me.
Now much semantic information is lost. The loop optimization definitely
cannot be applied.
Edge profiling is tricky. Edge profiling requires splitting critical edges

  • it is not clear how you can do this after the machine basic block layout

is finalized.

David

Repository:

rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION

https://reviews.llvm.org/D104060/new/

https://reviews.llvm.org/D104060

Good summary.

For instance, the important Kirichoff's circult law (aka spanning tree)
optimization is not implemented. (I added the optimization to clang
-fprofile-generate).

You probably meant -fprofile-instr-generate :)

Thanks for noticing the typo:) Corrected. @xur implemented the optimization for -fprofile-generate (D12781). I ported it to -fprofile-arcs.
-fprofile-instr-generate doesn't have the optimization.

ellis added a comment.EditedWed, Jun 16, 4:16 PM

I've created a gist comment to help explain some of the unique implementation details and discuss reasons for adding a new framework instead of extending some existing pgi.

https://gist.github.com/ellishg/46f04bc7761e64841653f0bcad64a1aa
(The gist has been moved to a comment so people can easily add comments)

We can also take discussions offline if you have more questions. You can email me (ellis.sparky.hoag@gmail.com) and @kyulee (kyulee@fb.com) to setup a VC.

CC @MaskRay @davidxl

ellis added a comment.Wed, Jun 16, 5:09 PM

Section Layout

MIP's major feature is the ability to extract all metadata from the instrumented binary to reduce its size. This is possible by using two sections, __llvm_mipraw to store profile data and __llvm_mipmap to store function metadata. The map section is extracted from the binary before runtime, and the raw section is dumped at runtime. Both sections are then combined to a final .mip profile. Both sections have a 24 byte header and a box of data for each instrumented function.

__llvm_mipraw

The contents of this section depends on the type of instrumentation used, but the data is always "unreadable" without the map section. For function coverage, we simply allocate a single byte for each instrumented function and initialize it to all ones.

_Z3foov$RAW:
  .byte  0xff

__llvm_mipmap

The map section allows us to map instrumented functions to their profile data in the raw section. The two most important values are the function name and the offset to the profile data in the raw section. With these values we can read a function's profile data using the offset.

_Z3foov$MAP:
.Lref:
  .long  __start___llvm_mipraw-.Lref     # Raw Section Start PC Offset
  .long  _Z3foov$RAW-.Lref               # Raw Profile Symbol PC Offset
  .long  _Z3foov-.Lref                   # Function PC Offset
  .long  [[FOO_END]]-_Z3foov             # Function Size
  .long  0x70c9fa27                      # CFG Signature
  .long  2                               # Non-entry Block Count
  .long  [[FOO_BLOCK0]]-_Z3foov          # Block 0 Offset
  .long  [[FOO_BLOCK1]]-_Z3foov          # Block 1 Offset
  .long  7                               # Function Name Length
  .ascii  "_Z3foov"

The main challenge is storing the offset to the profile data without using dynamic relocations. This is complicated by the fact that we use comdat sections within the __llvm_mipraw section and that ELF does not seem to directly support section relative addresses. The solution is to use PC relative relocations. __start___llvm_mipraw-.Lref gives us the PC relative offset to the start of the raw section and _Z3foov$RAW-.Lref gives us the offset to the profile data for this function relative to the same PC. After we extract the map section, we can subtract these to get the value we want, the section relative raw profile data offset.

(_Z3foov$RAW-.Lref) - (__start___llvm_mipraw-.Lref) = _Z3foov$RAW - __start___llvm_mipraw

We can use the same trick to encode the function address, we just need to also add the address of the raw section which can be looked up in the binary. This is useful to lookup debug info and join it with our final profile data.

The other values are relatively straightforward. The function size and block offsets allow us to map block coverage data to debug info. The control flow graph signature is used to verify that the function has not changed since the profile was collected. And last we have the mangled function name to help identify the function.

Integration with Existing PGI?

There seems to be a question about why we chose to implement MIP from scratch instead of extending an existing framework. If we were to extend -fprofile-instr-generate we would need to have extractable metadata, which may be too invasive to implement.

  • As shown above, a lot of work was done to make sure the metadata can be extracted correctly.
  • Existing pgi has structured raw data that would need to be moved to the extractable metadata section.
  • Our MIP tool has a llvm-mipdata create command which converts the map section to an “empty” profile that can later be merged with raw profiles. Existing pgi tools do not have this extra step.

MIP Edge Profile

We omitted the code that adds call edge profiles to reduce the complexity of the current review, but we do plan to upload it. For each instrumented function, we sample return address values so we can build a dynamic call graph from the profile. The format is largely the same, but we added a size-configurable static buffer to hold the return address values. This approach correctly profiles dynamic dispatch calls that are found in Objective-C and Swift as well as any normal function calls. We can, for example, identify candidate for objc_direct using this call edge data.

Optimization

We also omitted the profile consumption and optimization code from the current review. Our focus for optimization is on outlining to reduce size and function ordering to reduce page faults. When we consume profile data with function coverage, block coverage, or function call counts we can make smarter inlining and outlining decisions based on hotness. When we have profile data with timestamps, call counts, or the dynamic call graph then we can generate an optimal function order that reduces the number of page faults.

MaskRay added a comment.EditedWed, Jun 16, 5:32 PM

I think people's main question is what distinguishing features make MachineFunction instrumentation appealing.

MIP Edge Profile, Optimization

The two are very inconvenient at the MachineFunction/MachineBasicBlock level...
I don't know how you can make edge profiling work for BB transitions...

I changed clang -fprofile-arcs to use critical edge splitting insert of PHI nodes and noticed some code generation improvement.
I know this is difficult/infeasible if you cannot split machine basic blocks...

For the profile format, there is indeed a bit redundancy in -fprofile-generate/-fprofile-instr-generate.
Some fields are reserved even if value profiling is not used. I do not have a good idea how we can save the space for coverage usage.
Some fields are 64-bit for generality. As I mentioned, a 32-bit CFG signature makes it less robust when the number of functions exceed roughly 2**16.
The 32-bit Function PC Offset is probably sufficient for most usage but will not work with medium/large code model programs.

I have been slowly trying to making -fprofile-generate/-fprofile-instr-generate object files/linkaged images smaller (e.g. D103372) since last year (I have much to learn..).
(I can test on Linux and Windows, so I'll try making both work. I don't have Mach-O but I am happy to report whatever issues I have found, though.)
I do plan to try PC-relative relocations (I made such improvement for XRay: D78082/D78590/D87977; the only portability issue is that we will require the integrated assembler for mips64)
and probably make the symbol in __llvm_prf_data local alias to avoid an R_*_RELATIVE dynamic relocation.
(I need to study more about llvm-profdata.)

ellis added a comment.Wed, Jun 16, 6:04 PM

__llvm_mip_call_counts_caller is slow.
It is a function with a custom call convention using RAX as the argument on x86-64.
The impl detail function saves and restores many vector registers.
I haven't studied why __llvm_mip_call_counts_caller is needed.

Yes, __llvm_mip_call_counts_caller is not optimal, but we wanted to first have correctness. Since we are injecting calls to the runtime at the very beginning of functions, we save/restore the stack frame in __llvm_mip_call_counts_caller. In our return address instrumentation code, we also use this helper function to pass the return address register to the runtime.

__llvm_mipmap has these fields. I added an inline comment that -shared doesn't work.

Unfortunately, yes, it seems -shared does not work, but I don't know enough about it to have ideas for fixes at the moment.

        .section        __llvm_mipmap,"aw",@progbits
        .globl  _Z3fooPiS_$MAP
        .p2align        3
_Z3fooPiS_$MAP:
.Lref2:
  ### not sure why this is needed
        .long   __start___llvm_mipraw-.Lref2    # Raw Section Start PC Offset

  ##### this does not link in -fpic -shared mode
        .long   _Z3fooPiS_$RAW-.Lref2           # Raw Profile Symbol PC Offset

        .long   _Z3fooPiS_-.Lref2               # Function PC Offset
        .long   .Lmip_func_end0-_Z3fooPiS_      # Function Size
        .long   0x0                             # CFG Signature
        .long   0                               # Non-entry Block Count
        .long   10                              # Function Name Length
        .ascii  "_Z3fooPiS_"

In the previous comment I describe these fields in detail.

Now this patch series adds machine basic blocks instrumentation.
I wonder what it can do while the regular IR instrumentation cannot.

Machine basic block instrumentation has some awkward points.
Semantic information is lost. The loop optimization definitely cannot be applied.
If an IR basic block maps to multiple machine basic blocks, you need multiple increments for each MBB while with IR BB you may just need one (e.g. dominator).
Edge profiling is tricky. Edge profiling requires splitting critical edges - it is not clear how you can do this after the machine basic block layout is finalized.

The benefit of instrumenting machine basic blocks is we can easily mark MBBs that were not executed as candidates for outlining. We can definitely apply Kirchoff's cirtuit law optimization to reduce the number of stores.

ellis added a comment.Wed, Jun 16, 6:08 PM

I have been slowly trying to making -fprofile-generate/-fprofile-instr-generate object files/linkaged images smaller (e.g. D103372) since last year (I have much to learn..).
(I can test on Linux and Windows, so I'll try making both work. I don't have Mach-O but I am happy to report whatever issues I have found, though.)
I do plan to try PC-relative relocations (I made such improvement for XRay: D78082/D78590/D87977; the only portability issue is that we will require the integrated assembler for mips64)
and probably make the symbol in __llvm_prf_data local alias to avoid an R_*_RELATIVE dynamic relocation.
(I need to study more about llvm-profdata.)

I'm really happy to see this work! I also have much to learn so I'll try to keep an eye out for related diffs in the future.

ellis added a comment.Wed, Jun 16, 6:17 PM

For the profile format, there is indeed a bit redundancy in -fprofile-generate/-fprofile-instr-generate.
Some fields are reserved even if value profiling is not used. I do not have a good idea how we can save the space for coverage usage.
Some fields are 64-bit for generality. As I mentioned, a 32-bit CFG signature makes it less robust when the number of functions exceed roughly 2**16.

Actually, the 32-bit CFG signature we use in MIP is not unique to functions. If two functions have the same basic block layout, they will have the same CFG signature and that is not a problem. The field is used to determine if a function has changed from when it was profiled. I'm not sure if this is different from the other profile formats.

The 32-bit Function PC Offset is probably sufficient for most usage but will not work with medium/large code model programs.

That is true, but we wanted to use 32 bit values to maintain consistency with armv7 targets. We could probably add a flag to support 64 bit values if we need to.

ellis updated this revision to Diff 352597.Wed, Jun 16, 6:37 PM

The __llvm_mipmap section should not have the SHF_ALLOC bit set.

I think people's main question is what distinguishing features make MachineFunction instrumentation appealing.

MIP Edge Profile, Optimization

The two are very inconvenient at the MachineFunction/MachineBasicBlock level...
I don't know how you can make edge profiling work for BB transitions...

MIP does not (cannot) collect BB edge data but MachineBlock coverage as needed (optional).
But, MIP can collect call-edge data for all call-sites including dynamic dispatch calls that are not covered by LLVM IR instrumentation.
As commented earlier, MIP is initially designed for mobile applications where majority calls are dynamic. In this world, size or minimum size optimizations are typically enabled.
So, traditional speed optimizations like inlining or vectorization from BB edge profiles were not a great concern.
Instead, the MIP data were mainly used for ordering, separation, or outlining to minimize CPU penalties while saving as much size as possible.

We understand if we want an IR level profile (e.g. BB edge profiles) for IR level optimizations, it would be tricky because MIP instrument Machine IRs.
However, internally we've already experimented a SamplePGO like conversion to generate LLVM IR profile converted from MIP using symbolication.
Certainly this will lose precision, but it's generally good enough in this domain because the majority of speed optimizations will be blocked anyhow under minimum size-opt.

Nonetheless, I think it's also worth revisiting MIP-like implementation at IR level to support full IR profiles including BB edge profile by reusing as much LLVM IR instrumentation code as possible.
I do still think refactoring the profile format of the existing LLVM IR instrumentation for the integration of MIP seems too disruptive while breaking the existing infra or usage.
Instead, I'd like to keep the mipmap (metadata) layout in MC to make them extractable, and this means we may retain pseudo ops all the way down to MC from IR.

ellis updated this revision to Diff 352871.Thu, Jun 17, 4:33 PM

Raw symbols should have hidden visibility so that -fPIC -shared works

Some email conversations are not on Phabricator. I record a copy here so that people who are not subscribed can have a full view

davidxl: I believe you mean -fprofile-generate or -fcs-profile-generate. -fprofile-instr-generate is based on front end AST and eventually will be hidden under -fcoverage-test for source coveraging purpose only.
davidxl: As you can see we are currently making an effort to unify/simplify the PGO implementation. Having yet another instrumentation mechanism is doing the opposite.

As shown above, a lot of work was done to make sure the metadata can be extracted correctly.
Existing pgi has structured raw data that would need to be moved to the extractable metadata section.
Our MIP tool has a llvm-mipdata create command which converts the map section to an “empty” profile that can later be merged with raw profiles. Existing pgi tools do not have this extra step.

davidxl: As I said, size improvement efforts (under options) are welcome for the existing IRPGO. Another benefit is that we can have consolidated effort on improving related toolings.

MIP Edge Profile

davidxl: Adding this duplicate functionality (edge profiling) makes it even less compelling to do it MIR level.
davidxl: For the dynamic dispatching profiling, does it handle any number of indirect targets or only supports topN?

Optimization

davidxl: See above, adding any missing features in the existing framework is a more prefered approach. I am yet to see convincing arguments that spinning off a new instrumentation framework is the way to go.


maskray: I think people's main question is what distinguishing features make MachineFunction instrumentation appealing.

maskray: > MIP Edge Profile, Optimization

maskray: The two are very inconvenient at the MachineFunction/MachineBasicBlock level...
maskray: I don't know how you can make edge profiling work for BB transitions...

kyulee: MIP does not (cannot) collect BB edge data but MachineBlock coverage as needed (optional).
kyulee: But, MIP can collect call-edge data for all call-sites including dynamic dispatch calls that are not covered by LLVM IR instrumentation.

davidxl: IR instrumentation supports indirect call target profiling. I suppose MIP has a lightweight mechanism at the cost of tracking precision? Anyway, I don't think this is not something IR instrumentation can not have.

kyulee: As commented earlier, MIP is initially designed for mobile applications where majority calls are dynamic. In this world, size or minimum size optimizations are typically enabled.
kyulee: So, traditional speed optimizations like inlining or vectorization from BB edge profiles were not a great concern.
kyulee: Instead, the MIP data were mainly used for ordering, separation, or outlining to minimize CPU penalties while saving as much size as possible.

davidxl: Edge profiling helps size optimization as well -- we recently added OptimizeForSize support at BB level so that cold blocks can be better size optimized.
davidxl:
davidxl: Another plug -- if you are interested in size optimization, the ML based size optimization is also available in LLVM -- it beats -Oz.

kyulee:We understand if we want an IR level profile (e.g. BB edge profiles) for IR level optimizations, it would be tricky because MIP instrument Machine IRs.

davidxl: -fcs-profile-generate is very late in the IR pipeline after inlining transformations, so there is very little information loss when passing to MIR.

kyulee: However, internally we've already experimented a SamplePGO like conversion to generate LLVM IR profile converted from MIP using symbolication.
kyulee: Certainly this will lose precision, but it's generally good enough in this domain because the majority of speed optimizations will be blocked anyhow under minimum size-opt.

davidxl: Do you mean 'converting MIP' for IR passes?

kyulee: Nonetheless, I think it's also worth revisiting MIP-like implementation at IR level to support full IR profiles including BB edge profile by reusing as much LLVM IR instrumentation code as possible.

davidxl: yes.

kyulee: I do still think refactoring the profile format of the existing LLVM IR instrumentation for the integration of MIP seems too disruptive while breaking the existing infra or usage.

davidxl: It is fine to do this under an option and produce profile data with different magic number or flavor bit -- this is well supported.


kyulee: I understand IR instrumentation has value-profile for indirect call targets at call-site. I don’t think IR instrumentation covers dynamic dispatch call-site like msgSend whose target resolution happens at runtime. Instead of instrumenting each call-site, MIP tracks return address values at the function entry via a sort of back-tracking to reconstruct call-edges regardless of all type of calls – direct, indirect, or dynamic.
kyulee: MIP does this instrumentation at post-RA where assembly level coding is relatively straightforward. I think doing this at IR before frame-lowering will need extra overhead/mechanic to ensure this instrumentation happens in the very beginning of function.

davidxl: Do you mean 'converting MIP' for IR passes?

kyulee: Yes. MIP is not IR-attached, but rather tagged on machine address with which we can easily correlate debug data.
kyulee: So, it’s possible to construct a SamplePGO profile that is consumable for IR passes.

davidxl: Basically the callsite context (counter) is passed to the caller so it can do the profiling. GCC does that too.
davidxl: For IRPGO, we plan to add dynamic type profiling at some point. Once that is ready, the problem of message passing style call profiling will be handled.
davidxl: Also if edge profiling is available, profiling direct calls will be a waste as the information can be deduced.

MaskRay added a comment.EditedFri, Jun 18, 10:06 AM

@ellis

The main challenge is storing the offset to the profile data without using dynamic relocations. This is complicated by the fact that we use comdat sections within the llvm_mipraw section and that ELF does not seem to directly support section relative addresses. The solution is to use PC relative relocations. start___llvm_mipraw-.Lref gives us the PC relative offset to the start of the raw section and _Z3foov$RAW-.Lref gives us the offset to the profile data for this function relative to the same PC. After we extract the map section, we can subtract these to get the value we want, the section relative raw profile data offset.

(_Z3foov$RAW-.Lref) - (start_llvm_mipraw-.Lref) = _Z3foov$RAW - start_llvm_mipraw

I am unclear about this. Why does the $MAP section needs to know its relative position?

Note: if the $RAW symbol has a local linkage or has the hidden visibility, a label difference can indeed avoid a dynamic relative relocation. I have a patch D104556

# ELF: R_X86_64_PC64
.quad .L__profc_foo-.L__profd_foo

# Mach-O: a pair of X86_64_RELOC_UNSIGNED and X86_64_RELOC_SUBTRACTOR
.quad l___profc_foo-l___profd_foo

Unfortunately this is currently not representable in COFF:

% clang -fprofile-generate a.c -c -target x86_64-windows -o d.o
error: Cannot represent this expression

We can use the same trick to encode the function address, we just need to also add the address of the raw section which can be looked up in the binary. This is useful to lookup debug info and join it with our final profile data.

A __profd_$name variable has a similar field referencing the function. It is used by IPVK_IndirectCallTarget so that indirect call target profile can be translated to function names.

We can save the dynamic relocation with the following scheme:

; Note the sub constexpr
@__profd_main = private global { i64, i64, i64*, i64, i8*, i32, [2 x i16] } { i64 -2624081020897602054, i64 742261418966908927, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_main, i32 0, i32 0), i64 sub (i64 ptrtoint (i32 ()* @alias_main to i64), i64 ptrtoint ({ i64, i64, i64*, i64, i8*, i32, [2 x i16] }* @__profd_main to i64)), i8* null, i32 1, [2 x i16] zeroinitializer }, section "__llvm_prf_data", comdat($__profc_main), align 8

@alias_main = private alias i32 (), i32 ()* @main

define i32 @main() #0 {
  ...

For other fields of __profd_$name, other than the generality and mandatory value profiling (even if you don't use it) issues I mentioned in https://reviews.llvm.org/D104060#2818268 ,
I think the $MAP format is the same.

The $MAP format does not implement these things:

  • using private linkage as much as possible
  • ELF comdat any/noduplicates (this increases object file sizes but can enable --gc-sections for linked images)
  • function name compression
ellis marked an inline comment as done.Fri, Jun 18, 10:50 AM

@davidxl

davidxl: For the dynamic dispatching profiling, does it handle any number of indirect targets or only supports topN?

Our edge profiling patch (not included) was also designed with code size and runtime in mind. We allocate a buffer and sample return addresses, possibly overwriting old values when the buffer is full. Data for common callsites will overwrite data for rare callsites, so I guess it's more like top N, but the size of the buffer can be as large as needed.

davidxl: Basically the callsite context (counter) is passed to the caller so it can do the profiling. GCC does that too.

Our method does not need to touch the callsite code to work. In fact, most of the implementation is done in D104089 so we mostly only change the runtime code.

davidxl: For IRPGO, we plan to add dynamic type profiling at some point. Once that is ready, the problem of message passing style call profiling will be handled.
davidxl: Also if edge profiling is available, profiling direct calls will be a waste as the information can be deduced.

@davidxl

davidxl: For the dynamic dispatching profiling, does it handle any number of indirect targets or only supports topN?

Our edge profiling patch (not included) was also designed with code size and runtime in mind. We allocate a buffer and sample return addresses, possibly overwriting old values when the buffer is full. Data for common callsites will overwrite data for rare callsites, so I guess it's more like top N, but the size of the buffer can be as large as needed.

Neat -- so the value profile (target histogram) of a given callsite is 'distributed' into the callees -- aka each callee function allocating a fixed size buffer to track all incoming edge frequencies? For some small utility functions, they usually have thousands of callsites (incoming edges), thus this approach may significantly reduce the profile precision for them? The simple LRU eviction policy may be bad for some patterns like ping-pong effect.

davidxl: Basically the callsite context (counter) is passed to the caller so it can do the profiling. GCC does that too.

Our method does not need to touch the callsite code to work. In fact, most of the implementation is done in D104089 so we mostly only change the runtime code.

Will take a look.

davidxl: For IRPGO, we plan to add dynamic type profiling at some point. Once that is ready, the problem of message passing style call profiling will be handled.
davidxl: Also if edge profiling is available, profiling direct calls will be a waste as the information can be deduced.

@ellis

The main challenge is storing the offset to the profile data without using dynamic relocations. This is complicated by the fact that we use comdat sections within the llvm_mipraw section and that ELF does not seem to directly support section relative addresses. The solution is to use PC relative relocations. start___llvm_mipraw-.Lref gives us the PC relative offset to the start of the raw section and _Z3foov$RAW-.Lref gives us the offset to the profile data for this function relative to the same PC. After we extract the map section, we can subtract these to get the value we want, the section relative raw profile data offset.

(_Z3foov$RAW-.Lref) - (start_llvm_mipraw-.Lref) = _Z3foov$RAW - start_llvm_mipraw

I am unclear about this. Why does the $MAP section needs to know its relative position?

The map section doesn't care about its relative position, but it is used to compute the value we want. Ideally we would do something simple like this to get the section relative address of the symbol.

_Z3foov$RAW-__start___llvm_mipraw

From my testing this doesn't work because these symbols are in different section in ELF (because we use comdat sections for the header). Also, IIRC there were other issues due to relocations getting resolved before the final executable was created and ending up with the wrong values. The solution in this patch was the only one that seems to work in all cases.

Note: if the $RAW symbol has a local linkage or has the hidden visibility, a label difference can indeed avoid a dynamic relative relocation.

# ELF: R_X86_64_PC64
.quad .L__profc_foo-.L__profd_foo

# Mach-O: a pair of X86_64_RELOC_UNSIGNED and X86_64_RELOC_SUBTRACTOR
.quad l___profc_foo-l___profd_foo

Unfortunately this may not be representable in COFF:

% clang -fprofile-generate a.c -c -target x86_64-windows -o d.o
error: Cannot represent this expression

I haven't tested COFF, but i think it might support section relative addresses which would make this format much simpler.

We can use the same trick to encode the function address, we just need to also add the address of the raw section which can be looked up in the binary. This is useful to lookup debug info and join it with our final profile data.

A __profd_$name variable has a similar field referencing the function. It is used by IPVK_IndirectCallTarget so that indirect call target profile can be translated to function names.

We can save the dynamic relocation with the following scheme:

; Note the sub constexpr
@__profd_main = private global { i64, i64, i64*, i64, i8*, i32, [2 x i16] } { i64 -2624081020897602054, i64 742261418966908927, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_main, i32 0, i32 0), i64 sub (i64 ptrtoint (i32 ()* @alias_main to i64), i64 ptrtoint ({ i64, i64, i64*, i64, i8*, i32, [2 x i16] }* @__profd_main to i64)), i8* null, i32 1, [2 x i16] zeroinitializer }, section "__llvm_prf_data", comdat($__profc_main), align 8

@alias_main = private alias i32 (), i32 ()* @main

define i32 @main() #0 {
  ...

For other fields of __profd_$name, other than the generality and mandatory value profiling (even if you don't use it) issues I mentioned in https://reviews.llvm.org/D104060#2818268 ,
I think the $MAP format is the same.

The format is probably the same. I'm interested to see if we can do something similar to replace our encoded function address.

The $MAP format does not implement these things:

  • using private linkage as much as possible

I think we can use private linkage in more places if we need to.

  • ELF comdat any/noduplicates (this increases object file sizes but can enable --gc-sections for linked images)
  • function name compression

We don't care about the size of the map section since it is extracted out.

I probably sound like a broken record, but we've spent a lot of time making sure the map section can be extracted correctly and that the raw section has no excess info. This is a major feature of MIP, the profile data and the function info are separated so that only the necessary data remains in the binary. The question is whether we should extend an existing pgi to support this feature or if MIP deserves to be its own framework. I do see the value in extending one of the many existing pgi to reduce duplicate work. My thoughts are that it would be too invasive to do everything we need; add one or two new sections, create a new .profraw and .profmap format, add a few flags, and extend the tools. By keeping MIP separate, we can make design decisions that align with our code size goal that may not be so easy to do in existing frameworks.

Neat -- so the value profile (target histogram) of a given callsite is 'distributed' into the callees -- aka each callee function allocating a fixed size buffer to track all incoming edge frequencies? For some small utility functions, they usually have thousands of callsites (incoming edges), thus this approach may significantly reduce the profile precision for them? The simple LRU eviction policy may be bad for some patterns like ping-pong effect.

Our current approach is to use a single global buffer for all callees and we store a value that identifies both a callsite address and the callee. We've considered other options like a buffer for each callee, but we settled with the global buffer approach to avoid locks, extra complexity, and runtime. Yeah there are cases where we oversample some edges. I'd say this patch is still the in the experimental stage so there is still work to be done.

Neat -- so the value profile (target histogram) of a given callsite is 'distributed' into the callees -- aka each callee function allocating a fixed size buffer to track all incoming edge frequencies? For some small utility functions, they usually have thousands of callsites (incoming edges), thus this approach may significantly reduce the profile precision for them? The simple LRU eviction policy may be bad for some patterns like ping-pong effect.

Our current approach is to use a single global buffer for all callees and we store a value that identifies both a callsite address and the callee. We've considered other options like a buffer for each callee, but we settled with the global buffer approach to avoid locks, extra complexity, and runtime. Yeah there are cases where we oversample some edges. I'd say this patch is still the in the experimental stage so there is still work to be done.

Single global buffer introduces more data races which can be worse with/without using lock compared with split buffers.

I probably sound like a broken record, but we've spent a lot of time making sure the map section can be extracted correctly and that the raw section has no excess info. This is a major feature of MIP, the profile data and the function info are separated so that only the necessary data remains in the binary. The question is whether we should extend an existing pgi to support this feature or if MIP deserves to be its own framework. I do see the value in extending one of the many existing pgi to reduce duplicate work. My thoughts are that it would be too invasive to do everything we need; add one or two new sections, create a new .profraw and .profmap format, add a few flags, and extend the tools. By keeping MIP separate, we can make design decisions that align with our code size goal that may not be so easy to do in existing frameworks.

The existing __llvm_prf_cnts/__llvm_prf_data can be exacted easily as well... What functionality do you find missing?

It requires a lot of efforts to make linker garbage collection work. How does MIP work better? I don't find comdat specific code or Mach-O S_ATTR_LIVE_SUPPORT.

I have a patch to make use label differences (PC-relative relocations on ELF) for CountersPtr: D104556. The tricky part is to make COFF work because COFF has some limitation.

Do you have measurement of metadata section size, dynamic relocation size, symbol table entry size for the __llvm_prf_* format and MIP?