Page MenuHomePhabricator

Machine IR Profile
AbandonedPublic

Authored by ellis on Jun 10 2021, 2:15 PM.

Details

Summary

Machine IR Profile (MIP)

tl;dr;

This is a proposal to introduce a new instrumentation pass that can produce optimization profiles with a focus on binary size and runtime performance of the instrumented binaries.

Our instrumented binaries record machine function call counts, machine function timestamps, machine basic block coverage, and a subset of the dynamic call graph. There is also a more lightweight mode that only collects machine function coverage data that has negligible runtime overhead and a binary size increase of 2-5% for instrumented binaries.

This is just the first patch of the WIP MIP project. The full branch can be found at https://github.com/ellishg/llvm-project

Motivation

In the mobile space, increasing binary size has an outsized impact on both runtime performance and download speed. Current instrumentation implementations such as XRay and GCov produce binaries that are too slow and too large to run on real mobile devices. We propose a new pass that injects instrumentation code at the machine ir level. At runtime, we write profile data to our custom __llvm_mipraw section that is eventually dumped to a .mipraw file. At buildtime, we emit a .mipmap file which we use to map function information to data in the .mipraw file. The result is that no redundant function info is stored in the binary, which allows our instrumentation to have minimal size overhead.

MIP has been implemented on ELF and Mach-O targets for x86_64, AArch64, and Armv7 with Thumb and Thumb2.

Performance

Our focus for now is on the performance and size of binaries that have injected instrumentation instead of binaries that have been optimized with instrumentation profiles. We collected some basic results from MultiSource/Benchmarks in llvm-test-suite for both MIP and clang’s instrumentation using the -fprofile-generate flag. It should be noted that this comparison is not fair because clang’s instrumentation collects much more data than just function coverage. However, we expect fully-featured MIP to have similar metrics.

Instrumented Binary Size

At the moment, we have implemented function coverage which injects one x86_64 instruction (7 bytes) and one byte of global data for each instrumented function, which should have minimal impact on binary size and runtime performance. In fact, our results show that we should expect MIP instrumented binaries to be only 2-5% larger. We contrast this with clang’s instrumentation, which can increase the binary size by 500-900%.

Instrumented Execution Time

We found that MIP had negligable execution time regressions when instrumented with MIP. Again, we can (unfairly) contrast this to -fprofile-generate which increased execution time by 1-40%.

Usage

We use the -fmachine-profile-generate clang flag to produce an instrumented binary and then use llvm-objcopy to extract the .mipmap file.

$ clang -g -fmachine-profile-generate main.cpp
$ llvm-objcopy --dump-section=__llvm_mipmap=default.mipmap a.out /dev/null
$ llvm-strip -g a.out -o a.out.stripped

This will produce the instrumented binary a.out and a map file default.mipmap.

When we run the binary, it will produce a default.mipraw file containing the profile data for that run.

$ ./a.out.stripped
$ ls
a.out    default.mipmap    default.mipraw    main.cpp

Then we use our custom tool to postprocess the raw profile and produce the final profile default.mip.

$ llvm-mipdata create -p default.mip default.mipmap
$ llvm-mipdata merge -p default.mip default.mipraw

If our binary has debug info, we can use it to report source information along with the profile data.

$ llvm-mipdata show -p default.mip --debug a.out
_Z3fooi
  Source Info: /home/main.cpp:9
  Call Count: 0
  Block Coverage:
     COLD COLD COLD COLD COLD

_Z3bari
  Source Info: /home/main.cpp:16
  Call Count: 1
  Block Coverage:
     HOT  HOT  COLD HOT  HOT

Finally, we can consume the profile using the clang flag -fmachine-profile-use= to produce a profile-optimized binary.

$ clang -fmachine-profile-use=default.mip main.cpp

Diff Detail

Unit TestsFailed

TimeTest
30 msx64 windows > flang-Unit.RuntimeGTest/_/FlangRuntimeTests_exe::TimeIntrinsics.CpuTime
Script: -- C:\ws\w6\llvm-project\premerge-checks\build\tools\flang\unittests\RuntimeGTest\.\FlangRuntimeTests.exe --gtest_filter=TimeIntrinsics.CpuTime

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ellis added a comment.Jun 10 2021, 7:05 PM

Can you compare the instrumentation overhead with -fcs-profile-generate? (as with -fprofile-generate). You may also want to disable value profiling (which can be expensive) in -fcs-profile-generate.

Thanks for sharing this! Specifically re: the -fprofile-generate v. MIP comparison for code coverage, perhaps it would be more fair to compare against -finstrument-function-entry-bare. The latter only adds one call per function, whereas -fprofile-generate can add instrument a function in more than one place.

@davidxl @vsk

Yes, I am planning to run some more simple benchmarks from llvm-test-suite for code size and performance of the instrumented binaries. I can compare MIP vs -finstrument-function-entry-bare, but are there any flags I should compare against?

You may compare the overhead with -fsanitize-coverage=func,inline-8bit-counters,pc-table. The latter only instruments the entry block of a function and uses a one-byte counter.

ellis added a comment.Jun 11 2021, 3:46 PM

I've collected some size and runtime metrics from llvm-test-suite. The steps and results can be found in this gist: https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69

Keep in mind that a major goal is that MIP instrumented binaries should be as small as possible without sacrificing usability or functionality. The metrics that I measured in the gist are the binary size and the execution time of instrumented binaries. Like I said in the description, MIP instrumented binaries are usually 2-5% larger than non-instrumented binaries. We can compare this to the -fsanitize-coverage=func,inline-bool-flag,pc-table flag which seems to produce instrumented binaries that can be 100x larger. The extra size likely comes from the function metadata that is left in the binary, whereas MIP extracts all excess metadata out of the binary.

I also looked into comparing against -finstrument-function-entry-bare and -finstrument-functions. It seems that they inject calls to void __cyg_profile_func_enter_bare(); or void __cyg_profile_func_enter (void *this_fn, void *call_site); at the start of each function. The first flag doesn't really provide enough context know the caller and mark it as covered. The second does provide this information, but it still would require some work to collect profile data in a consumable way as MIP does. That being said, these flags seem provide similar size and runtime overhead as MIP, albeit with much less functionality.

It should also be noted that all this is only considering MIP function coverage. MIP can profile much more data including call counts and the dynamic call graph, likely with very similar performance metrics.

Do you have runtime number with -fcs-profile-generate -mllvm
-disable-vp=true

This will provide more data (BB count) than what MIP can do.

Besides, more options can be added to control the type of profile data
collected in IRPGO to further reduce overhead.

David

ellis added a comment.Jun 11 2021, 4:40 PM

Do you have runtime number with -fcs-profile-generate -mllvm
-disable-vp=true

This will provide more data (BB count) than what MIP can do.

Besides, more options can be added to control the type of profile data
collected in IRPGO to further reduce overhead.

David

I've added those numbers here https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3777527

The flags -fcs-profile-generate -mllvm -disable-vp=true appears to increase binary size by 4-9x and execution time is increased in many cases.

ellis updated this revision to Diff 351601.Jun 11 2021, 5:28 PM

Fix clang-tidy errors

ellis added a comment.Jun 11 2021, 5:52 PM

@davidxl

Can you use the same set of benchmarks for comparison?

I'm not sure what you mean. In https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69 I used llvm-test-suite/MultiSource/Benchmarks/ for all tests.

  1. is MIP used in the test only for function coverage or have call counts collected as well?

Just function coverage for simplicity, but full MIP does not add very much to the size.

  1. for -fcs-profile-generate, the size increase may mostly come from the name section. Can you compare .text + .data + .bss

Sure, I can do this next week.

ellis updated this revision to Diff 351625.Jun 11 2021, 8:12 PM

MIP instrumentation pass should happen before branch relaxation

ellis added a comment.Jun 12 2021, 4:55 PM

@davidxl

I've added text size data to https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778109 and hopefully it is more clear. I randomly chose the test MultiSource/Benchmarks/FreeBench/fourinarow.test to show the size data. You can see the raw data in https://gist.github.com/ellishg/156639f24d728a88067f903cb53e1643

Base

"size..text": 4629
"size..data": 8,
"size..bss": 120,

MIP

"size..text": 4741,
"size..data": 8,
"size.__llvm_mipmap": 768,
"size.__llvm_mipraw": 41,
"size..bss": 120,

-fcs-profile-generate

"size..text": 24322,
"size..data": 176,
"size.__llvm_prf_cnts": 1336,
"size.__llvm_prf_data": 816,
"size.__llvm_prf_names": 117,
"size.__llvm_prf_vnds": 24576,
"size..bss": 8952,

Thanks for the data. With value profile disabled, __llvm_prf_vnds section
should not be emitted -- there is a bug in the compiler.

Thinking about MIP's use case a little, it seems that it actually matches
what xray does. Xray has very low runtime overhead and can be turned on
always : xra https://llvm.org/docs/XRay.htmly. Have you compare with xray
and consider using that?

David

ellis added a comment.

@davidxl

I've added text size data to
https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778109
and hopefully it is more clear. I randomly chose the test
MultiSource/Benchmarks/FreeBench/fourinarow.test to show the size data.
You can see the raw data in
https://gist.github.com/ellishg/156639f24d728a88067f903cb53e1643

Base

"size..text": 4629
"size..data": 8,
"size..bss": 120,

MIP

"size..text": 4741,
"size..data": 8,
"size.__llvm_mipmap": 768,
"size.__llvm_mipraw": 41,
"size..bss": 120,

-fcs-profile-generate

"size..text": 24322,
"size..data": 176,
"size.__llvm_prf_cnts": 1336,
"size.__llvm_prf_data": 816,
"size.__llvm_prf_names": 117,
"size.__llvm_prf_vnds": 24576,
"size..bss": 8952,

Repository:

rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION

https://reviews.llvm.org/D104060/new/

https://reviews.llvm.org/D104060

Thinking about MIP's use case a little, it seems that it actually matches
what xray does. Xray has very low runtime overhead and can be turned on
always : xra https://llvm.org/docs/XRay.htmly. Have you compare with xray
and consider using that?

In the space we care about, any increase in binary size regresses runtime performance. Yes there are other instrumentations that provide similar features with low runtime overhead, but MIP seems to be the only one that extracts out all metadata to minimize binary size overhead.

I've collected some section size data from the fourinarow test and have more size comparisons in https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778753
Please let me know if there are more flags I can use to turn off XRay features so that this is a more fair comparison.

XRay

-fxray-instrument -fxray-instrumentation-bundle=function-entry

size..text: 154802
size..data: 10468
size.xray_instr_map: 512
size..bss: 594144

Thinking about MIP's use case a little, it seems that it actually matches
what xray does. Xray has very low runtime overhead and can be turned on
always : xra https://llvm.org/docs/XRay.htmly. Have you compare with xray
and consider using that?

Perhaps xray could be improved to reduce code size overhead to have feature parity with MIP?

Thanks for sharing this! Specifically re: the -fprofile-generate v. MIP comparison for code coverage, perhaps it would be more fair to compare against -finstrument-function-entry-bare. The latter only adds one call per function, whereas -fprofile-generate can add instrument a function in more than one place.

+1

Some historical context, I wrote an instrumentation little while back https://reviews.llvm.org/D74362 which adds similar instrumentation for function entry but at the IR level.

David

ellis added a comment.

@davidxl

I've added text size data to
https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69#gistcomment-3778109
and hopefully it is more clear. I randomly chose the test
MultiSource/Benchmarks/FreeBench/fourinarow.test to show the size data.
You can see the raw data in
https://gist.github.com/ellishg/156639f24d728a88067f903cb53e1643

Base

"size..text": 4629
"size..data": 8,
"size..bss": 120,

MIP

"size..text": 4741,
"size..data": 8,
"size.__llvm_mipmap": 768,
"size.__llvm_mipraw": 41,
"size..bss": 120,

-fcs-profile-generate

"size..text": 24322,
"size..data": 176,
"size.__llvm_prf_cnts": 1336,
"size.__llvm_prf_data": 816,
"size.__llvm_prf_names": 117,
"size.__llvm_prf_vnds": 24576,
"size..bss": 8952,

Repository:

rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION

https://reviews.llvm.org/D104060/new/

https://reviews.llvm.org/D104060

Yes there are other instrumentations that provide similar features with low runtime overhead, but MIP seems to be the only one that extracts out all metadata to minimize binary size overhead.

I echo this. There are instrumentations with similar features (e.g, code coverage) but they required to keep metadata in the final binary, The other/rest lightweight instrumentations (without metadata) are far from complete to match the feature that MIP provides.
The key differentiation of MIP is that the metadata (mipmap) is expressed with relative relocations, which can be extractable in any build target and platform, which took us non-trivial amount of time to make them right.
In theory, we could improve other existing insturmentations like MIP to extract metadata, but this may require a significant amount of changes in the architect and format.

MIP has achieved great size reduction for instrumented binary. My
understanding the savings are mainly from the following:

  1. Smaller counter size (1 byte or 4 byte instead of 8 byte for IR PGO)
  2. extractable per function metadata (mipmap). Using this technique may

increase object file size more (due to extra relocations), but will reduce
executable size.

I don't understand (in block coverage mode) how the .text size can be
reduced. I have not looked at the patch in detail, but the test case shows
that the counter update is simply a 'movb 0, $counter' even in non-coverage
mode, is that expected?

I think 2) is something worth introducing to IRPGO under a flag.

David

MaskRay added inline comments.Jun 13 2021, 3:00 PM
llvm/lib/CodeGen/MIPSectionEmitter.cpp
61

I want to try out the patch but I have noticed some layering violation.

If you do a -DBUILD_SHARED_LIBS=on build, you'll get errors like

ld.lld: error: undefined symbol: llvm::AsmPrinter::createTempSymbol(llvm::Twine const&) const
>>> referenced by MIPSectionEmitter.cpp:60 (/home/maskray/llvm/llvm/lib/CodeGen/MIPSectionEmitter.cpp:60)
>>>               lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MIPSectionEmitter.cpp.o:(llvm::MIPSectionEmitter::runOnMachineFunctionStart(llvm::MachineFunction&))
>>> referenced by MIPSectionEmitter.cpp:128 (/home/maskray/llvm/llvm/lib/CodeGen/MIPSectionEmitter.cpp:128)
>>>               lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MIPSectionEmitter.cpp.o:(llvm::MIPSectionEmitter::emitMIPHeader(llvm::MachineProfile::MIPFileType))
>>> referenced by MIPSectionEmitter.cpp:237 (/home/maskray/llvm/llvm/lib/CodeGen/MIPSectionEmitter.cpp:237)
>>>               lib/CodeGen/CMakeFiles/LLVMCodeGen.dir/MIPSectionEmitter.cpp.o:(llvm::MIPSectionEmitter::emitMIPFunctionInfo(llvm::MIPSectionEmitter::MFInfo&))

because of -Wl,-z,defs

MIP has achieved great size reduction for instrumented binary. My
understanding the savings are mainly from the following:

  1. Smaller counter size (1 byte or 4 byte instead of 8 byte for IR PGO)
  2. extractable per function metadata (mipmap). Using this technique may

increase object file size more (due to extra relocations), but will reduce
executable size.

That is correct. For function coverage MIP inject only one movb instruction in x86 so that the total overhead for every function is 1 byte of global data + 7 bytes of text.

I don't understand (in block coverage mode) how the .text size can be
reduced. I have not looked at the patch in detail, but the test case shows
that the counter update is simply a 'movb 0, $counter' even in non-coverage
mode, is that expected?

Let me try to clear something up. The size tests I ran in https://gist.github.com/ellishg/92a68cf82bfdeccd10225154425edc69 were for function coverage only and block coverage disabled. For block coverage, we inject a single movb instruction and another global byte for each basic block so the size overhead is very similar to function coverage. Can you point me to the test case you are referring to?

I think 2) is something worth introducing to IRPGO under a flag.

David

ellis updated this revision to Diff 352006.Jun 14 2021, 3:17 PM

Fix build failures when using -DDBUILD_SHARED_LIBS=On. Thanks @MaskRay

MaskRay added a comment.EditedJun 14 2021, 4:33 PM

I have played with the patch.

-fmachine-profile-generate only inserts a __llvm_mip_call_counts_caller function call. There is no basic block instrumentation, so this is just the function entry count coverage mode.

-fmachine-profile-generate -fmachine-profile-function-coverage changes the __llvm_mip_call_counts_caller call to movb $0, counter(%rip).
So this matches the traditional binary coverage mode.
This mode is supported by clang -fsanitize-coverage=func,inline-bool-flag,pc-table.
inline-bool-flag uses a conditional set because IIUC under concurrency this is faster than a racy write.
If needed -fsanitize-coverage=inline-bool-flag can introduce a mode to use a racy write.

-fmachine-profile-generate -fmachine-profile-block-coverage inserts movb $0, counter(%rip) for machine basic blocks.
This is a vertex profile (less powerful than a edge profile).
This mode is supported by clang -fsanitize-coverage=edge,inline-bool-flag,pc-table
Mapping the information to source files will require debug info (-g1).

Traditional gcc/clang coverage features (-fprofile-arcs/-fprofile-instr-generate/-fprofile-generate) are all about edge profiles and use word-size counters.
If the size is a concern, it is probably reasonable to use 32-bit counters, but smaller counters may not be suitable for PGO.


__llvm_mip_call_counts_caller is slow.
It is a function with a custom call convention using RAX as the argument on x86-64.
The impl detail function saves and restores many vector registers.
I haven't studied why __llvm_mip_call_counts_caller is needed.


__llvm_prf_data (-fprofile-generate, -fprofile-instr-generate) vs __llvm_mipmap (-fmachine-profile-generate)

In the absence of value profiling, __llvm_prf_data uses:

.quad NameRef
.quad FuncHash
.quad .L__profc_fun
.quad fun   # need a dynamic relocation; used by raw profile reader
.quad 0     # value profiling
.long NumCounters
.long 0     # value profiling, 2 unused value sites

If we want to save size for small linked images, we can change some .quad to .long.
e.g. if the number of functions is smaller than 2**16 (or slightly larger), we can use a 32-bit hash.
.L__profc_fun can use .long if the size cannot overflow 32-bit.

Note that 2 fields are only used by value profiling.

__llvm_mipmap has these fields. I added an inline comment that -shared doesn't work.

        .section        __llvm_mipmap,"aw",@progbits
        .globl  _Z3fooPiS_$MAP
        .p2align        3
_Z3fooPiS_$MAP:
.Lref2:
  ### not sure why this is needed
        .long   __start___llvm_mipraw-.Lref2    # Raw Section Start PC Offset

  ##### this does not link in -fpic -shared mode
        .long   _Z3fooPiS_$RAW-.Lref2           # Raw Profile Symbol PC Offset

        .long   _Z3fooPiS_-.Lref2               # Function PC Offset
        .long   .Lmip_func_end0-_Z3fooPiS_      # Function Size
        .long   0x0                             # CFG Signature
        .long   0                               # Non-entry Block Count
        .long   10                              # Function Name Length
        .ascii  "_Z3fooPiS_"

So the existing instrumentations are either in Clang or LLVM IR. Sort them by their position in the compiler pipeline:

  • -fprofile-instr-generate: clang CodeGen. Very few optimizations can be applied.
  • -fprofile-arcs: very early in the optimization pipeline. The coverage information is from debug info, so being early can have debug info with fidelity.
  • -fcs-profile-generate: in buildModuleSimplificationPipeline, no pre-inliner, before PassBuilder::buildInlinerPipeline
  • -fprofile-generate: in buildModuleSimplificationPipeline, after pre-inliner, before PassBuilder::buildInlinerPipeline; then a separate context-sensitive PGO instrumentation/use is added after inlining, early in the optimization pipeline

My understanding about -fprofile-instr-generate vs -fprofile-generate.

In clang you have full information about line/column/region information.
So -fprofile-instr-generate works well for coverage.

At IR level, we can use debug info to convey some source-level information to IR but some information is lost (e.g. column information), so clang -fprofile-arcs is less accurate.

However, the frontend is not at a good position applying various optimizations.

For instance, the important Kirchoff's circult law (aka spanning tree) optimization is not implemented in -fprofile-instr-generate. (I added the optimization to clang -fprofile-arcs).
So in bad cases (e.g. libvpx) -fprofile-instr-generate can be 15% slower than -fprofile-arcs/-fprofile-generate.
(
As of why it is so efficient: normally the edge count |E| is only slightly larger than the vertex count |V|. There is a large difference between |E|-|V|+1 and |E|.

For instance, a no-branch function just needs one counter.
)

The loop optimization (instead of adding a counter N times, add N to it) cannot be enabled.
The benefit is relatively small, though.

The frontend cannot apply inlining or some early optimizations to greatly decrease the number of counters.

Now this patch series adds machine basic blocks instrumentation.
I wonder what it can do while the regular IR instrumentation cannot.

Machine basic block instrumentation has some awkward points.
Semantic information is lost. The loop optimization definitely cannot be applied.
If an IR basic block maps to multiple machine basic blocks, you need multiple increments for each MBB while with IR BB you may just need one (e.g. dominator).
Edge profiling is tricky. Edge profiling requires splitting critical edges - it is not clear how you can do this after the machine basic block layout is finalized.

Good summary.

For instance, the important Kirichoff's circult law (aka spanning tree)
optimization is not implemented. (I added the optimization to clang
-fprofile-generate).

You probably meant -fprofile-instr-generate :)

So in bad cases (e.g. libvpx) -fprofile-instr-generate can be 15% slower
than -fprofile-arcs/-fprofile-generate.

The loop optimization (instead of adding a counter N times, add N to it)
cannot be enabled.
The benefit is relatively small, though.

The frontend cannot apply inlining or some early optimizations to greatly
decrease the number of counters.

Instrumenting machine basic blocks feels awkward to me.
Now much semantic information is lost. The loop optimization definitely
cannot be applied.
Edge profiling is tricky. Edge profiling requires splitting critical edges

  • it is not clear how you can do this after the machine basic block layout

is finalized.

David

Repository:

rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION

https://reviews.llvm.org/D104060/new/

https://reviews.llvm.org/D104060

Good summary.

For instance, the important Kirichoff's circult law (aka spanning tree)
optimization is not implemented. (I added the optimization to clang
-fprofile-generate).

You probably meant -fprofile-instr-generate :)

Thanks for noticing the typo:) Corrected. @xur implemented the optimization for -fprofile-generate (D12781). I ported it to -fprofile-arcs.
-fprofile-instr-generate doesn't have the optimization.

ellis added a comment.EditedJun 16 2021, 4:16 PM

I've created a gist comment to help explain some of the unique implementation details and discuss reasons for adding a new framework instead of extending some existing pgi.

https://gist.github.com/ellishg/46f04bc7761e64841653f0bcad64a1aa
(The gist has been moved to a comment so people can easily add comments)

We can also take discussions offline if you have more questions. You can email me (ellis.sparky.hoag@gmail.com) and @kyulee (kyulee@fb.com) to setup a VC.

CC @MaskRay @davidxl

ellis added a comment.Jun 16 2021, 5:09 PM

Section Layout

MIP's major feature is the ability to extract all metadata from the instrumented binary to reduce its size. This is possible by using two sections, __llvm_mipraw to store profile data and __llvm_mipmap to store function metadata. The map section is extracted from the binary before runtime, and the raw section is dumped at runtime. Both sections are then combined to a final .mip profile. Both sections have a 24 byte header and a box of data for each instrumented function.

__llvm_mipraw

The contents of this section depends on the type of instrumentation used, but the data is always "unreadable" without the map section. For function coverage, we simply allocate a single byte for each instrumented function and initialize it to all ones.

_Z3foov$RAW:
  .byte  0xff

__llvm_mipmap

The map section allows us to map instrumented functions to their profile data in the raw section. The two most important values are the function name and the offset to the profile data in the raw section. With these values we can read a function's profile data using the offset.

_Z3foov$MAP:
.Lref:
  .long  __start___llvm_mipraw-.Lref     # Raw Section Start PC Offset
  .long  _Z3foov$RAW-.Lref               # Raw Profile Symbol PC Offset
  .long  _Z3foov-.Lref                   # Function PC Offset
  .long  [[FOO_END]]-_Z3foov             # Function Size
  .long  0x70c9fa27                      # CFG Signature
  .long  2                               # Non-entry Block Count
  .long  [[FOO_BLOCK0]]-_Z3foov          # Block 0 Offset
  .long  [[FOO_BLOCK1]]-_Z3foov          # Block 1 Offset
  .long  7                               # Function Name Length
  .ascii  "_Z3foov"

The main challenge is storing the offset to the profile data without using dynamic relocations. This is complicated by the fact that we use comdat sections within the __llvm_mipraw section and that ELF does not seem to directly support section relative addresses. The solution is to use PC relative relocations. __start___llvm_mipraw-.Lref gives us the PC relative offset to the start of the raw section and _Z3foov$RAW-.Lref gives us the offset to the profile data for this function relative to the same PC. After we extract the map section, we can subtract these to get the value we want, the section relative raw profile data offset.

(_Z3foov$RAW-.Lref) - (__start___llvm_mipraw-.Lref) = _Z3foov$RAW - __start___llvm_mipraw

We can use the same trick to encode the function address, we just need to also add the address of the raw section which can be looked up in the binary. This is useful to lookup debug info and join it with our final profile data.

The other values are relatively straightforward. The function size and block offsets allow us to map block coverage data to debug info. The control flow graph signature is used to verify that the function has not changed since the profile was collected. And last we have the mangled function name to help identify the function.

Integration with Existing PGI?

There seems to be a question about why we chose to implement MIP from scratch instead of extending an existing framework. If we were to extend -fprofile-instr-generate we would need to have extractable metadata, which may be too invasive to implement.

  • As shown above, a lot of work was done to make sure the metadata can be extracted correctly.
  • Existing pgi has structured raw data that would need to be moved to the extractable metadata section.
  • Our MIP tool has a llvm-mipdata create command which converts the map section to an “empty” profile that can later be merged with raw profiles. Existing pgi tools do not have this extra step.

MIP Edge Profile

We omitted the code that adds call edge profiles to reduce the complexity of the current review, but we do plan to upload it. For each instrumented function, we sample return address values so we can build a dynamic call graph from the profile. The format is largely the same, but we added a size-configurable static buffer to hold the return address values. This approach correctly profiles dynamic dispatch calls that are found in Objective-C and Swift as well as any normal function calls. We can, for example, identify candidate for objc_direct using this call edge data.

Optimization

We also omitted the profile consumption and optimization code from the current review. Our focus for optimization is on outlining to reduce size and function ordering to reduce page faults. When we consume profile data with function coverage, block coverage, or function call counts we can make smarter inlining and outlining decisions based on hotness. When we have profile data with timestamps, call counts, or the dynamic call graph then we can generate an optimal function order that reduces the number of page faults.

MaskRay added a comment.EditedJun 16 2021, 5:32 PM

I think people's main question is what distinguishing features make MachineFunction instrumentation appealing.

MIP Edge Profile, Optimization

The two are very inconvenient at the MachineFunction/MachineBasicBlock level...
I don't know how you can make edge profiling work for BB transitions...

I changed clang -fprofile-arcs to use critical edge splitting insert of PHI nodes and noticed some code generation improvement.
I know this is difficult/infeasible if you cannot split machine basic blocks...

For the profile format, there is indeed a bit redundancy in -fprofile-generate/-fprofile-instr-generate.
Some fields are reserved even if value profiling is not used. I do not have a good idea how we can save the space for coverage usage.
Some fields are 64-bit for generality. As I mentioned, a 32-bit CFG signature makes it less robust when the number of functions exceed roughly 2**16.
The 32-bit Function PC Offset is probably sufficient for most usage but will not work with medium/large code model programs.

I have been slowly trying to making -fprofile-generate/-fprofile-instr-generate object files/linkaged images smaller (e.g. D103372) since last year (I have much to learn..).
(I can test on Linux and Windows, so I'll try making both work. I don't have Mach-O but I am happy to report whatever issues I have found, though.)
I do plan to try PC-relative relocations (I made such improvement for XRay: D78082/D78590/D87977; the only portability issue is that we will require the integrated assembler for mips64)
and probably make the symbol in __llvm_prf_data local alias to avoid an R_*_RELATIVE dynamic relocation.
(I need to study more about llvm-profdata.)

ellis added a comment.Jun 16 2021, 6:04 PM

__llvm_mip_call_counts_caller is slow.
It is a function with a custom call convention using RAX as the argument on x86-64.
The impl detail function saves and restores many vector registers.
I haven't studied why __llvm_mip_call_counts_caller is needed.

Yes, __llvm_mip_call_counts_caller is not optimal, but we wanted to first have correctness. Since we are injecting calls to the runtime at the very beginning of functions, we save/restore the stack frame in __llvm_mip_call_counts_caller. In our return address instrumentation code, we also use this helper function to pass the return address register to the runtime.

__llvm_mipmap has these fields. I added an inline comment that -shared doesn't work.

Unfortunately, yes, it seems -shared does not work, but I don't know enough about it to have ideas for fixes at the moment.

        .section        __llvm_mipmap,"aw",@progbits
        .globl  _Z3fooPiS_$MAP
        .p2align        3
_Z3fooPiS_$MAP:
.Lref2:
  ### not sure why this is needed
        .long   __start___llvm_mipraw-.Lref2    # Raw Section Start PC Offset

  ##### this does not link in -fpic -shared mode
        .long   _Z3fooPiS_$RAW-.Lref2           # Raw Profile Symbol PC Offset

        .long   _Z3fooPiS_-.Lref2               # Function PC Offset
        .long   .Lmip_func_end0-_Z3fooPiS_      # Function Size
        .long   0x0                             # CFG Signature
        .long   0                               # Non-entry Block Count
        .long   10                              # Function Name Length
        .ascii  "_Z3fooPiS_"

In the previous comment I describe these fields in detail.

Now this patch series adds machine basic blocks instrumentation.
I wonder what it can do while the regular IR instrumentation cannot.

Machine basic block instrumentation has some awkward points.
Semantic information is lost. The loop optimization definitely cannot be applied.
If an IR basic block maps to multiple machine basic blocks, you need multiple increments for each MBB while with IR BB you may just need one (e.g. dominator).
Edge profiling is tricky. Edge profiling requires splitting critical edges - it is not clear how you can do this after the machine basic block layout is finalized.

The benefit of instrumenting machine basic blocks is we can easily mark MBBs that were not executed as candidates for outlining. We can definitely apply Kirchoff's cirtuit law optimization to reduce the number of stores.

ellis added a comment.Jun 16 2021, 6:08 PM

I have been slowly trying to making -fprofile-generate/-fprofile-instr-generate object files/linkaged images smaller (e.g. D103372) since last year (I have much to learn..).
(I can test on Linux and Windows, so I'll try making both work. I don't have Mach-O but I am happy to report whatever issues I have found, though.)
I do plan to try PC-relative relocations (I made such improvement for XRay: D78082/D78590/D87977; the only portability issue is that we will require the integrated assembler for mips64)
and probably make the symbol in __llvm_prf_data local alias to avoid an R_*_RELATIVE dynamic relocation.
(I need to study more about llvm-profdata.)

I'm really happy to see this work! I also have much to learn so I'll try to keep an eye out for related diffs in the future.

ellis added a comment.Jun 16 2021, 6:17 PM

For the profile format, there is indeed a bit redundancy in -fprofile-generate/-fprofile-instr-generate.
Some fields are reserved even if value profiling is not used. I do not have a good idea how we can save the space for coverage usage.
Some fields are 64-bit for generality. As I mentioned, a 32-bit CFG signature makes it less robust when the number of functions exceed roughly 2**16.

Actually, the 32-bit CFG signature we use in MIP is not unique to functions. If two functions have the same basic block layout, they will have the same CFG signature and that is not a problem. The field is used to determine if a function has changed from when it was profiled. I'm not sure if this is different from the other profile formats.

The 32-bit Function PC Offset is probably sufficient for most usage but will not work with medium/large code model programs.

That is true, but we wanted to use 32 bit values to maintain consistency with armv7 targets. We could probably add a flag to support 64 bit values if we need to.

ellis updated this revision to Diff 352597.Jun 16 2021, 6:37 PM

The __llvm_mipmap section should not have the SHF_ALLOC bit set.

I think people's main question is what distinguishing features make MachineFunction instrumentation appealing.

MIP Edge Profile, Optimization

The two are very inconvenient at the MachineFunction/MachineBasicBlock level...
I don't know how you can make edge profiling work for BB transitions...

MIP does not (cannot) collect BB edge data but MachineBlock coverage as needed (optional).
But, MIP can collect call-edge data for all call-sites including dynamic dispatch calls that are not covered by LLVM IR instrumentation.
As commented earlier, MIP is initially designed for mobile applications where majority calls are dynamic. In this world, size or minimum size optimizations are typically enabled.
So, traditional speed optimizations like inlining or vectorization from BB edge profiles were not a great concern.
Instead, the MIP data were mainly used for ordering, separation, or outlining to minimize CPU penalties while saving as much size as possible.

We understand if we want an IR level profile (e.g. BB edge profiles) for IR level optimizations, it would be tricky because MIP instrument Machine IRs.
However, internally we've already experimented a SamplePGO like conversion to generate LLVM IR profile converted from MIP using symbolication.
Certainly this will lose precision, but it's generally good enough in this domain because the majority of speed optimizations will be blocked anyhow under minimum size-opt.

Nonetheless, I think it's also worth revisiting MIP-like implementation at IR level to support full IR profiles including BB edge profile by reusing as much LLVM IR instrumentation code as possible.
I do still think refactoring the profile format of the existing LLVM IR instrumentation for the integration of MIP seems too disruptive while breaking the existing infra or usage.
Instead, I'd like to keep the mipmap (metadata) layout in MC to make them extractable, and this means we may retain pseudo ops all the way down to MC from IR.

ellis updated this revision to Diff 352871.Jun 17 2021, 4:33 PM

Raw symbols should have hidden visibility so that -fPIC -shared works

Some email conversations are not on Phabricator. I record a copy here so that people who are not subscribed can have a full view

davidxl: I believe you mean -fprofile-generate or -fcs-profile-generate. -fprofile-instr-generate is based on front end AST and eventually will be hidden under -fcoverage-test for source coveraging purpose only.
davidxl: As you can see we are currently making an effort to unify/simplify the PGO implementation. Having yet another instrumentation mechanism is doing the opposite.

As shown above, a lot of work was done to make sure the metadata can be extracted correctly.
Existing pgi has structured raw data that would need to be moved to the extractable metadata section.
Our MIP tool has a llvm-mipdata create command which converts the map section to an “empty” profile that can later be merged with raw profiles. Existing pgi tools do not have this extra step.

davidxl: As I said, size improvement efforts (under options) are welcome for the existing IRPGO. Another benefit is that we can have consolidated effort on improving related toolings.

MIP Edge Profile

davidxl: Adding this duplicate functionality (edge profiling) makes it even less compelling to do it MIR level.
davidxl: For the dynamic dispatching profiling, does it handle any number of indirect targets or only supports topN?

Optimization

davidxl: See above, adding any missing features in the existing framework is a more prefered approach. I am yet to see convincing arguments that spinning off a new instrumentation framework is the way to go.


maskray: I think people's main question is what distinguishing features make MachineFunction instrumentation appealing.

maskray: > MIP Edge Profile, Optimization

maskray: The two are very inconvenient at the MachineFunction/MachineBasicBlock level...
maskray: I don't know how you can make edge profiling work for BB transitions...

kyulee: MIP does not (cannot) collect BB edge data but MachineBlock coverage as needed (optional).
kyulee: But, MIP can collect call-edge data for all call-sites including dynamic dispatch calls that are not covered by LLVM IR instrumentation.

davidxl: IR instrumentation supports indirect call target profiling. I suppose MIP has a lightweight mechanism at the cost of tracking precision? Anyway, I don't think this is not something IR instrumentation can not have.

kyulee: As commented earlier, MIP is initially designed for mobile applications where majority calls are dynamic. In this world, size or minimum size optimizations are typically enabled.
kyulee: So, traditional speed optimizations like inlining or vectorization from BB edge profiles were not a great concern.
kyulee: Instead, the MIP data were mainly used for ordering, separation, or outlining to minimize CPU penalties while saving as much size as possible.

davidxl: Edge profiling helps size optimization as well -- we recently added OptimizeForSize support at BB level so that cold blocks can be better size optimized.
davidxl:
davidxl: Another plug -- if you are interested in size optimization, the ML based size optimization is also available in LLVM -- it beats -Oz.

kyulee:We understand if we want an IR level profile (e.g. BB edge profiles) for IR level optimizations, it would be tricky because MIP instrument Machine IRs.

davidxl: -fcs-profile-generate is very late in the IR pipeline after inlining transformations, so there is very little information loss when passing to MIR.

kyulee: However, internally we've already experimented a SamplePGO like conversion to generate LLVM IR profile converted from MIP using symbolication.
kyulee: Certainly this will lose precision, but it's generally good enough in this domain because the majority of speed optimizations will be blocked anyhow under minimum size-opt.

davidxl: Do you mean 'converting MIP' for IR passes?

kyulee: Nonetheless, I think it's also worth revisiting MIP-like implementation at IR level to support full IR profiles including BB edge profile by reusing as much LLVM IR instrumentation code as possible.

davidxl: yes.

kyulee: I do still think refactoring the profile format of the existing LLVM IR instrumentation for the integration of MIP seems too disruptive while breaking the existing infra or usage.

davidxl: It is fine to do this under an option and produce profile data with different magic number or flavor bit -- this is well supported.


kyulee: I understand IR instrumentation has value-profile for indirect call targets at call-site. I don’t think IR instrumentation covers dynamic dispatch call-site like msgSend whose target resolution happens at runtime. Instead of instrumenting each call-site, MIP tracks return address values at the function entry via a sort of back-tracking to reconstruct call-edges regardless of all type of calls – direct, indirect, or dynamic.
kyulee: MIP does this instrumentation at post-RA where assembly level coding is relatively straightforward. I think doing this at IR before frame-lowering will need extra overhead/mechanic to ensure this instrumentation happens in the very beginning of function.

davidxl: Do you mean 'converting MIP' for IR passes?

kyulee: Yes. MIP is not IR-attached, but rather tagged on machine address with which we can easily correlate debug data.
kyulee: So, it’s possible to construct a SamplePGO profile that is consumable for IR passes.

davidxl: Basically the callsite context (counter) is passed to the caller so it can do the profiling. GCC does that too.
davidxl: For IRPGO, we plan to add dynamic type profiling at some point. Once that is ready, the problem of message passing style call profiling will be handled.
davidxl: Also if edge profiling is available, profiling direct calls will be a waste as the information can be deduced.

MaskRay added a comment.EditedJun 18 2021, 10:06 AM

@ellis

The main challenge is storing the offset to the profile data without using dynamic relocations. This is complicated by the fact that we use comdat sections within the llvm_mipraw section and that ELF does not seem to directly support section relative addresses. The solution is to use PC relative relocations. start___llvm_mipraw-.Lref gives us the PC relative offset to the start of the raw section and _Z3foov$RAW-.Lref gives us the offset to the profile data for this function relative to the same PC. After we extract the map section, we can subtract these to get the value we want, the section relative raw profile data offset.

(_Z3foov$RAW-.Lref) - (start_llvm_mipraw-.Lref) = _Z3foov$RAW - start_llvm_mipraw

I am unclear about this. Why does the $MAP section needs to know its relative position?

Note: if the $RAW symbol has a local linkage or has the hidden visibility, a label difference can indeed avoid a dynamic relative relocation. I have a patch D104556

# ELF: R_X86_64_PC64
.quad .L__profc_foo-.L__profd_foo

# Mach-O: a pair of X86_64_RELOC_UNSIGNED and X86_64_RELOC_SUBTRACTOR
.quad l___profc_foo-l___profd_foo

Unfortunately this is currently not representable in COFF:

% clang -fprofile-generate a.c -c -target x86_64-windows -o d.o
error: Cannot represent this expression

We can use the same trick to encode the function address, we just need to also add the address of the raw section which can be looked up in the binary. This is useful to lookup debug info and join it with our final profile data.

A __profd_$name variable has a similar field referencing the function. It is used by IPVK_IndirectCallTarget so that indirect call target profile can be translated to function names.

We can save the dynamic relocation with the following scheme:

; Note the sub constexpr
@__profd_main = private global { i64, i64, i64*, i64, i8*, i32, [2 x i16] } { i64 -2624081020897602054, i64 742261418966908927, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_main, i32 0, i32 0), i64 sub (i64 ptrtoint (i32 ()* @alias_main to i64), i64 ptrtoint ({ i64, i64, i64*, i64, i8*, i32, [2 x i16] }* @__profd_main to i64)), i8* null, i32 1, [2 x i16] zeroinitializer }, section "__llvm_prf_data", comdat($__profc_main), align 8

@alias_main = private alias i32 (), i32 ()* @main

define i32 @main() #0 {
  ...

For other fields of __profd_$name, other than the generality and mandatory value profiling (even if you don't use it) issues I mentioned in https://reviews.llvm.org/D104060#2818268 ,
I think the $MAP format is the same.

The $MAP format does not implement these things:

  • using private linkage as much as possible
  • ELF comdat any/noduplicates (this increases object file sizes but can enable --gc-sections for linked images)
  • function name compression
ellis marked an inline comment as done.Jun 18 2021, 10:50 AM

@davidxl

davidxl: For the dynamic dispatching profiling, does it handle any number of indirect targets or only supports topN?

Our edge profiling patch (not included) was also designed with code size and runtime in mind. We allocate a buffer and sample return addresses, possibly overwriting old values when the buffer is full. Data for common callsites will overwrite data for rare callsites, so I guess it's more like top N, but the size of the buffer can be as large as needed.

davidxl: Basically the callsite context (counter) is passed to the caller so it can do the profiling. GCC does that too.

Our method does not need to touch the callsite code to work. In fact, most of the implementation is done in D104089 so we mostly only change the runtime code.

davidxl: For IRPGO, we plan to add dynamic type profiling at some point. Once that is ready, the problem of message passing style call profiling will be handled.
davidxl: Also if edge profiling is available, profiling direct calls will be a waste as the information can be deduced.

@davidxl

davidxl: For the dynamic dispatching profiling, does it handle any number of indirect targets or only supports topN?

Our edge profiling patch (not included) was also designed with code size and runtime in mind. We allocate a buffer and sample return addresses, possibly overwriting old values when the buffer is full. Data for common callsites will overwrite data for rare callsites, so I guess it's more like top N, but the size of the buffer can be as large as needed.

Neat -- so the value profile (target histogram) of a given callsite is 'distributed' into the callees -- aka each callee function allocating a fixed size buffer to track all incoming edge frequencies? For some small utility functions, they usually have thousands of callsites (incoming edges), thus this approach may significantly reduce the profile precision for them? The simple LRU eviction policy may be bad for some patterns like ping-pong effect.

davidxl: Basically the callsite context (counter) is passed to the caller so it can do the profiling. GCC does that too.

Our method does not need to touch the callsite code to work. In fact, most of the implementation is done in D104089 so we mostly only change the runtime code.

Will take a look.

davidxl: For IRPGO, we plan to add dynamic type profiling at some point. Once that is ready, the problem of message passing style call profiling will be handled.
davidxl: Also if edge profiling is available, profiling direct calls will be a waste as the information can be deduced.

@ellis

The main challenge is storing the offset to the profile data without using dynamic relocations. This is complicated by the fact that we use comdat sections within the llvm_mipraw section and that ELF does not seem to directly support section relative addresses. The solution is to use PC relative relocations. start___llvm_mipraw-.Lref gives us the PC relative offset to the start of the raw section and _Z3foov$RAW-.Lref gives us the offset to the profile data for this function relative to the same PC. After we extract the map section, we can subtract these to get the value we want, the section relative raw profile data offset.

(_Z3foov$RAW-.Lref) - (start_llvm_mipraw-.Lref) = _Z3foov$RAW - start_llvm_mipraw

I am unclear about this. Why does the $MAP section needs to know its relative position?

The map section doesn't care about its relative position, but it is used to compute the value we want. Ideally we would do something simple like this to get the section relative address of the symbol.

_Z3foov$RAW-__start___llvm_mipraw

From my testing this doesn't work because these symbols are in different section in ELF (because we use comdat sections for the header). Also, IIRC there were other issues due to relocations getting resolved before the final executable was created and ending up with the wrong values. The solution in this patch was the only one that seems to work in all cases.

Note: if the $RAW symbol has a local linkage or has the hidden visibility, a label difference can indeed avoid a dynamic relative relocation.

# ELF: R_X86_64_PC64
.quad .L__profc_foo-.L__profd_foo

# Mach-O: a pair of X86_64_RELOC_UNSIGNED and X86_64_RELOC_SUBTRACTOR
.quad l___profc_foo-l___profd_foo

Unfortunately this may not be representable in COFF:

% clang -fprofile-generate a.c -c -target x86_64-windows -o d.o
error: Cannot represent this expression

I haven't tested COFF, but i think it might support section relative addresses which would make this format much simpler.

We can use the same trick to encode the function address, we just need to also add the address of the raw section which can be looked up in the binary. This is useful to lookup debug info and join it with our final profile data.

A __profd_$name variable has a similar field referencing the function. It is used by IPVK_IndirectCallTarget so that indirect call target profile can be translated to function names.

We can save the dynamic relocation with the following scheme:

; Note the sub constexpr
@__profd_main = private global { i64, i64, i64*, i64, i8*, i32, [2 x i16] } { i64 -2624081020897602054, i64 742261418966908927, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_main, i32 0, i32 0), i64 sub (i64 ptrtoint (i32 ()* @alias_main to i64), i64 ptrtoint ({ i64, i64, i64*, i64, i8*, i32, [2 x i16] }* @__profd_main to i64)), i8* null, i32 1, [2 x i16] zeroinitializer }, section "__llvm_prf_data", comdat($__profc_main), align 8

@alias_main = private alias i32 (), i32 ()* @main

define i32 @main() #0 {
  ...

For other fields of __profd_$name, other than the generality and mandatory value profiling (even if you don't use it) issues I mentioned in https://reviews.llvm.org/D104060#2818268 ,
I think the $MAP format is the same.

The format is probably the same. I'm interested to see if we can do something similar to replace our encoded function address.

The $MAP format does not implement these things:

  • using private linkage as much as possible

I think we can use private linkage in more places if we need to.

  • ELF comdat any/noduplicates (this increases object file sizes but can enable --gc-sections for linked images)
  • function name compression

We don't care about the size of the map section since it is extracted out.

I probably sound like a broken record, but we've spent a lot of time making sure the map section can be extracted correctly and that the raw section has no excess info. This is a major feature of MIP, the profile data and the function info are separated so that only the necessary data remains in the binary. The question is whether we should extend an existing pgi to support this feature or if MIP deserves to be its own framework. I do see the value in extending one of the many existing pgi to reduce duplicate work. My thoughts are that it would be too invasive to do everything we need; add one or two new sections, create a new .profraw and .profmap format, add a few flags, and extend the tools. By keeping MIP separate, we can make design decisions that align with our code size goal that may not be so easy to do in existing frameworks.

Neat -- so the value profile (target histogram) of a given callsite is 'distributed' into the callees -- aka each callee function allocating a fixed size buffer to track all incoming edge frequencies? For some small utility functions, they usually have thousands of callsites (incoming edges), thus this approach may significantly reduce the profile precision for them? The simple LRU eviction policy may be bad for some patterns like ping-pong effect.

Our current approach is to use a single global buffer for all callees and we store a value that identifies both a callsite address and the callee. We've considered other options like a buffer for each callee, but we settled with the global buffer approach to avoid locks, extra complexity, and runtime. Yeah there are cases where we oversample some edges. I'd say this patch is still the in the experimental stage so there is still work to be done.

Neat -- so the value profile (target histogram) of a given callsite is 'distributed' into the callees -- aka each callee function allocating a fixed size buffer to track all incoming edge frequencies? For some small utility functions, they usually have thousands of callsites (incoming edges), thus this approach may significantly reduce the profile precision for them? The simple LRU eviction policy may be bad for some patterns like ping-pong effect.

Our current approach is to use a single global buffer for all callees and we store a value that identifies both a callsite address and the callee. We've considered other options like a buffer for each callee, but we settled with the global buffer approach to avoid locks, extra complexity, and runtime. Yeah there are cases where we oversample some edges. I'd say this patch is still the in the experimental stage so there is still work to be done.

Single global buffer introduces more data races which can be worse with/without using lock compared with split buffers.

I probably sound like a broken record, but we've spent a lot of time making sure the map section can be extracted correctly and that the raw section has no excess info. This is a major feature of MIP, the profile data and the function info are separated so that only the necessary data remains in the binary. The question is whether we should extend an existing pgi to support this feature or if MIP deserves to be its own framework. I do see the value in extending one of the many existing pgi to reduce duplicate work. My thoughts are that it would be too invasive to do everything we need; add one or two new sections, create a new .profraw and .profmap format, add a few flags, and extend the tools. By keeping MIP separate, we can make design decisions that align with our code size goal that may not be so easy to do in existing frameworks.

The existing __llvm_prf_cnts/__llvm_prf_data can be exacted easily as well... What functionality do you find missing?

It requires a lot of efforts to make linker garbage collection work. How does MIP work better? I don't find comdat specific code or Mach-O S_ATTR_LIVE_SUPPORT.

I have a patch to make use label differences (PC-relative relocations on ELF) for CountersPtr: D104556. The tricky part is to make COFF work because COFF has some limitation.

Do you have measurement of metadata section size, dynamic relocation size, symbol table entry size for the __llvm_prf_* format and MIP?

ellis updated this revision to Diff 353485.Jun 21 2021, 2:14 PM

__llvm_mipmap section should have type ELF::SHF_NOTE so that it doesn't get stripped by --gc-sections.

__llvm_mipmap section should have type ELF::SHF_NOTE so that it doesn't get stripped by --gc-sections.

This is an abuse of the section type SHF_NOTE. GNU ld retaining allocable SHF_NOTE sections under --gc-sections is a somewhat unfortunate fact. I feel it is inappropriate and I know a FreeBSD folk who doesn't like this rule.

A metadata section design making GC work under all of the 3 major binary formats (ELF/PE-COFF/Mach-O) is very complex. You may dig some history related to __llvm_prf_* and __prof*. That may be another argument that we should probably focus on one format. After D104556 and a follow-up patch making the function address relative, I think __llvm_prf_* will be in a very good state.

__llvm_mipmap section should have type ELF::SHF_NOTE so that it doesn't get stripped by --gc-sections.

This is an abuse of the section type SHF_NOTE. GNU ld retaining allocable SHF_NOTE sections under --gc-sections is a somewhat unfortunate fact. I feel it is inappropriate and I know a FreeBSD folk who doesn't like this rule.

No? We don't like the fact that it _doesn't_ get retained in certain cases, because that's where our ABI branding lives (as does glibc's, but they don't care all that much about it, as they only brand executables, not libraries, so it's of limited use...). We need it to be present, always, and cannot yet rely on SHT_RETAIN as a feature introduced in the past year is far too new to be able to rely on.

ellis added a comment.Jun 21 2021, 8:49 PM

__llvm_mipmap section should have type ELF::SHF_NOTE so that it doesn't get stripped by --gc-sections.

This is an abuse of the section type SHF_NOTE. GNU ld retaining allocable SHF_NOTE sections under --gc-sections is a somewhat unfortunate fact. I feel it is inappropriate and I know a FreeBSD folk who doesn't like this rule.

I found that SHF_GNU_RETAIN was not enough, but I'm happy to explore other ways if SHF_NOTE is not preferred.

A metadata section design making GC work under all of the 3 major binary formats (ELF/PE-COFF/Mach-O) is very complex. You may dig some history related to __llvm_prf_* and __prof*. That may be another argument that we should probably focus on one format. After D104556 and a follow-up patch making the function address relative, I think __llvm_prf_* will be in a very good state.

Thanks, I'll take a look at __llvm_prf_* and friends.

int3 added a subscriber: int3.Jun 21 2021, 11:13 PM

I probably sound like a broken record, but we've spent a lot of time making sure the map section can be extracted correctly and that the raw section has no excess info. This is a major feature of MIP, the profile data and the function info are separated so that only the necessary data remains in the binary. The question is whether we should extend an existing pgi to support this feature or if MIP deserves to be its own framework. I do see the value in extending one of the many existing pgi to reduce duplicate work. My thoughts are that it would be too invasive to do everything we need; add one or two new sections, create a new .profraw and .profmap format, add a few flags, and extend the tools. By keeping MIP separate, we can make design decisions that align with our code size goal that may not be so easy to do in existing frameworks.

The existing __llvm_prf_cnts/__llvm_prf_data can be exacted easily as well... What functionality do you find missing?

Thanks for the patch D104556 that uses PC relative relocations for reference! This avoids dynamic relocation, which is the prerequisite to make a section (region) be extractable.
Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.
So, we need some notion of section (or region) relative addressing which is not all available in COFF/ELF/MachO. One trick in MIP is to introduce an anchor reference to express such addressing with two PC relative relocations.
In fact, this anchor reference can be hoisted out into the header which we can share, which is not in this change yet.
To interoperate and make all different __llvm_prof_* be extractable, I think we should introduce a header (weak or comdat) like MIP for each __llvm_prof_* and bookkeep distance/relocations within it across regions.
I can see it's technically doable but the cost and risk seems high since we need to restructure __llvm_prof_* and its dependencies. If we're meant to just introduce another section like __llvm_mipmap under a flag in LLVM IR infra, this is just confusing.

I probably sound like a broken record, but we've spent a lot of time making sure the map section can be extracted correctly and that the raw section has no excess info. This is a major feature of MIP, the profile data and the function info are separated so that only the necessary data remains in the binary. The question is whether we should extend an existing pgi to support this feature or if MIP deserves to be its own framework. I do see the value in extending one of the many existing pgi to reduce duplicate work. My thoughts are that it would be too invasive to do everything we need; add one or two new sections, create a new .profraw and .profmap format, add a few flags, and extend the tools. By keeping MIP separate, we can make design decisions that align with our code size goal that may not be so easy to do in existing frameworks.

The existing __llvm_prf_cnts/__llvm_prf_data can be exacted easily as well... What functionality do you find missing?

Thanks for the patch D104556 that uses PC relative relocations for reference! This avoids dynamic relocation, which is the prerequisite to make a section (region) be extractable.
Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.
So, we need some notion of section (or region) relative addressing which is not all available in COFF/ELF/MachO. One trick in MIP is to introduce an anchor reference to express such addressing with two PC relative relocations.
In fact, this anchor reference can be hoisted out into the header which we can share, which is not in this change yet.

This is interesting. Can you describe a little more with the fuzzy matching, perhaps with a small example to demonstrate how it works?

David

To interoperate and make all different __llvm_prof_* be extractable, I think we should introduce a header (weak or comdat) like MIP for each __llvm_prof_* and bookkeep distance/relocations within it across regions.
I can see it's technically doable but the cost and risk seems high since we need to restructure __llvm_prof_* and its dependencies. If we're meant to just introduce another section like __llvm_mipmap under a flag in LLVM IR infra, this is just confusing.

ellis added a comment.Jun 23 2021, 3:00 PM

Thanks for the patch D104556 that uses PC relative relocations for reference! This avoids dynamic relocation, which is the prerequisite to make a section (region) be extractable.
Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.
So, we need some notion of section (or region) relative addressing which is not all available in COFF/ELF/MachO. One trick in MIP is to introduce an anchor reference to express such addressing with two PC relative relocations.
In fact, this anchor reference can be hoisted out into the header which we can share, which is not in this change yet.

This is interesting. Can you describe a little more with the fuzzy matching, perhaps with a small example to demonstrate how it works?

If you're asking about our trick for the raw profile addresses, I have a comment above that describes it in detail (the comment title is "Section Layout"). If your asking about something else, could you please elaborate?

ellis updated this revision to Diff 354086.Jun 23 2021, 3:15 PM

Use the ELF::SHF_GNU_RETAIN flag for the __llvm_mipmap section so that it doesn't get stripped by --gc-sections. The ELF::SHF_NOTE flag is not necessary.

Thanks for the patch D104556 that uses PC relative relocations for reference! This avoids dynamic relocation, which is the prerequisite to make a section (region) be extractable.
Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.
So, we need some notion of section (or region) relative addressing which is not all available in COFF/ELF/MachO. One trick in MIP is to introduce an anchor reference to express such addressing with two PC relative relocations.
In fact, this anchor reference can be hoisted out into the header which we can share, which is not in this change yet.

This is interesting. Can you describe a little more with the fuzzy matching, perhaps with a small example to demonstrate how it works?

If you're asking about our trick for the raw profile addresses, I have a comment above that describes it in detail (the comment title is "Section Layout"). If your asking about something else, could you please elaborate?

I was referring to this comment by kyulee@: "because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered." -- Is this about using stale profile or something else? I might have misunderstood the part about 'address ranges can be altered' part.

Thanks for the patch D104556 that uses PC relative relocations for reference! This avoids dynamic relocation, which is the prerequisite to make a section (region) be extractable.
Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.
So, we need some notion of section (or region) relative addressing which is not all available in COFF/ELF/MachO. One trick in MIP is to introduce an anchor reference to express such addressing with two PC relative relocations.
In fact, this anchor reference can be hoisted out into the header which we can share, which is not in this change yet.

This is interesting. Can you describe a little more with the fuzzy matching, perhaps with a small example to demonstrate how it works?

If you're asking about our trick for the raw profile addresses, I have a comment above that describes it in detail (the comment title is "Section Layout"). If your asking about something else, could you please elaborate?

I was referring to this comment by kyulee@: "because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered." -- Is this about using stale profile or something else? I might have misunderstood the part about 'address ranges can be altered' part.

Say a binary consists of two sections, C and D.
When D is extracted out (and thus C only remains), the extracted binary D starts with 0 while the address of D in the prior (or original) binary used to be after C. So, a direct addressing is simply invalid unless we bookeep the original location somehow.
In addition, we also saw the case where even C's address in the resulting (remaining) binary might be shifted via a strip process that seemed to alter the image base.
So, I think we need to express and operate with the references which are constant within each section or region.

This is a longer version, and I hope it helps for clarification on how and why MIP works in this context.
Let say we have two sections C (raw counter) and D (metadata) that has references to C where D is to be extracted out (at build-time).

CHeader:
C1:
C2:

Ideally, we want to layout D as below so that the contents in D are constant, but this is not possible in all platforms.

DHeader:
D1:
  &C1 - &CHeader // Section/region-relative to C1
D2:
  &C2 - &CHeader // Section/region-relative to C2

This is how MIP lays D in this patch.
This simulates section-relative expression with two PC-relative expressions. This incurs extra size for D but in practice, this is not a concern because D is meant to be serialized (or extracted) at build-time.
That being said, MIP does not optimize D for size but for simplicity in the maintenance. All metadata including name, etc. is wrapped and self-contained in each D's entry (function granularity) without spanning multiple metadata sections.
Note D is a sort of raw form of MIP's metadata, which is turned into an optimized (merged) profile form via llvm-mipdata create process once at build-time, which then we can merge with multiple Cs (raw counters) only via llvm-mipdata merge.

DHeader:
D1: 
  &CHeader - &D1(.)  // PC-relative to CHeader/Section
  &C1  - &D1(.) // PC-relative to C1
   // Subtract these two values to get the section-relative constant value -- &C1 - &CHeader =  (&C1  - &D1) - (&CHeader - &D1)
D2:
  &CHeader - &D2(.) // PC-relative to CHeader/Section
  &C2 - &D2(.) // PC-relative to C2

This might be an optimized version by sharing the reference to CHeader in DHeader, which I meant in the prior comment.

DHeader :
  &CHeader - &DHeader (.) // PC-relative to CHeader
D1: 
  &C1 - &D1(.) // PC-relative to C1
   // We know this offset, &D1 - &DHeader from the layout in D. 
   // With the common value in DHeader, compute this base value &CHeader - &D1 = (&CHeader - &DHeader) - (&D1 - &DHeader)
   // Then derive the section-relative value in the same way --  &C1 - &CHeader =  (&C1  - &D1) - (&CHeader - &D1)
D2:
  &C2 - &D2(.) // PC-relative to C2

I think a similar approach can be applied when multiple metadata sections exist. The key idea is to keep the distances among sections/regions (that will be in different address spaces) in the header, and use this common value + the layout offset to compute the section/region-relative constant value.
This last approach would largely maintain the size of metadata sections like D and thus allows them to be optionally extracted out. However, in the presence of multiple metadata sections, the complexity in dependencies (runtime and tools) seems high and also maintaining backward compatibility seems challenging.

Thanks for the patch D104556 that uses PC relative relocations for reference! This avoids dynamic relocation, which is the prerequisite to make a section (region) be extractable.
Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.
So, we need some notion of section (or region) relative addressing which is not all available in COFF/ELF/MachO. One trick in MIP is to introduce an anchor reference to express such addressing with two PC relative relocations.
In fact, this anchor reference can be hoisted out into the header which we can share, which is not in this change yet.

This is interesting. Can you describe a little more with the fuzzy matching, perhaps with a small example to demonstrate how it works?

If you're asking about our trick for the raw profile addresses, I have a comment above that describes it in detail (the comment title is "Section Layout"). If your asking about something else, could you please elaborate?

I was referring to this comment by kyulee@: "because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered." -- Is this about using stale profile or something else? I might have misunderstood the part about 'address ranges can be altered' part.

Say a binary consists of two sections, C and D.
When D is extracted out (and thus C only remains), the extracted binary D starts with 0 while the address of D in the prior (or original) binary used to be after C. So, a direct addressing is simply invalid unless we bookeep the original location somehow.
In addition, we also saw the case where even C's address in the resulting (remaining) binary might be shifted via a strip process that seemed to alter the image base.
So, I think we need to express and operate with the references which are constant within each section or region.

This is a longer version, and I hope it helps for clarification on how and why MIP works in this context.
Let say we have two sections C (raw counter) and D (metadata) that has references to C where D is to be extracted out (at build-time).

CHeader:
C1:
C2:

Ideally, we want to layout D as below so that the contents in D are constant, but this is not possible in all platforms.

DHeader:
D1:
  &C1 - &CHeader // Section/region-relative to C1
D2:
  &C2 - &CHeader // Section/region-relative to C2

This is how MIP lays D in this patch.
This simulates section-relative expression with two PC-relative expressions. This incurs extra size for D but in practice, this is not a concern because D is meant to be serialized (or extracted) at build-time.
That being said, MIP does not optimize D for size but for simplicity in the maintenance. All metadata including name, etc. is wrapped and self-contained in each D's entry (function granularity) without spanning multiple metadata sections.
Note D is a sort of raw form of MIP's metadata, which is turned into an optimized (merged) profile form via llvm-mipdata create process once at build-time, which then we can merge with multiple Cs (raw counters) only via llvm-mipdata merge.

DHeader:
D1: 
  &CHeader - &D1(.)  // PC-relative to CHeader/Section
  &C1  - &D1(.) // PC-relative to C1
   // Subtract these two values to get the section-relative constant value -- &C1 - &CHeader =  (&C1  - &D1) - (&CHeader - &D1)
D2:
  &CHeader - &D2(.) // PC-relative to CHeader/Section
  &C2 - &D2(.) // PC-relative to C2

This might be an optimized version by sharing the reference to CHeader in DHeader, which I meant in the prior comment.

DHeader :
  &CHeader - &DHeader (.) // PC-relative to CHeader
D1: 
  &C1 - &D1(.) // PC-relative to C1
   // We know this offset, &D1 - &DHeader from the layout in D. 
   // With the common value in DHeader, compute this base value &CHeader - &D1 = (&CHeader - &DHeader) - (&D1 - &DHeader)
   // Then derive the section-relative value in the same way --  &C1 - &CHeader =  (&C1  - &D1) - (&CHeader - &D1)
D2:
  &C2 - &D2(.) // PC-relative to C2

I think a similar approach can be applied when multiple metadata sections exist. The key idea is to keep the distances among sections/regions (that will be in different address spaces) in the header, and use this common value + the layout offset to compute the section/region-relative constant value.
This last approach would largely maintain the size of metadata sections like D and thus allows them to be optionally extracted out. However, in the presence of multiple metadata sections, the complexity in dependencies (runtime and tools) seems high and also maintaining backward compatibility seems challenging.

Thanks for the explanation about it. To summarize it is a technique to further reduce mipdata size by sharing the common header difference.

Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.

I think extraction works but maybe your extraction operation is different?

The __llvm_prf_{data,cnts} sections with D104556 work this way. In the notation below, the address differences are in-memory.

raw profile header:
  CountersDelta = &D0 - &C0

D0:
  CounterPtr = (in-memory address difference) &D0 - &C0
D1:
  CounterPtr = (in-memory address difference) &D1 - &C1

C0: ...
C1: ...

One can subtract CountersDelta from CounterPtr to get the on-disk D/C offset.

D104556 doesn't touch the FunctionPointer (absolute address) field in D.

Use the ELF::SHF_GNU_RETAIN flag for the __llvm_mipmap section so that it doesn't get stripped by --gc-sections. The ELF::SHF_NOTE flag is not necessary.

The problem is that all __llvm_mipmap sections will be retained, even if the referenced text sections are discarded.
According to https://reviews.llvm.org/D96757#2567631 there could be a lot of waste.

ellis added a comment.Jun 28 2021, 4:23 PM

Unfortunately this is not sufficient because we need to find the references once they are extracted out because the address ranges (either in the original binary and the extracted binary) are not valid or can be altered.

I think extraction works but maybe your extraction operation is different?

The __llvm_prf_{data,cnts} sections with D104556 work this way. In the notation below, the address differences are in-memory.

Thanks for working on D104556. I'll have to study this to see if it can be extracted in all the same cases as MIP.

raw profile header:
  CountersDelta = &D0 - &C0

D0:
  CounterPtr = (in-memory address difference) &D0 - &C0
D1:
  CounterPtr = (in-memory address difference) &D1 - &C1

C0: ...
C1: ...

One can subtract CountersDelta from CounterPtr to get the on-disk D/C offset.

D104556 doesn't touch the FunctionPointer (absolute address) field in D.

I haven't looked into the FunctionPointer field, but I assume it generate dynamic relocations, which cannot be extracted from a shared library.

Use the ELF::SHF_GNU_RETAIN flag for the __llvm_mipmap section so that it doesn't get stripped by --gc-sections. The ELF::SHF_NOTE flag is not necessary.

The problem is that all __llvm_mipmap sections will be retained, even if the referenced text sections are discarded.
According to https://reviews.llvm.org/D96757#2567631 there could be a lot of waste.

It appears that only the mipmap header needs to be retained, so I'll update the flags.

ellis updated this revision to Diff 355065.Jun 28 2021, 4:30 PM

Rename llvm flag and only the mipmap section header needs the SHF_GNU_RETAIN flag.

ellis updated this revision to Diff 355070.Jun 28 2021, 4:47 PM

Fix lit tests

ellis updated this revision to Diff 355946.Jul 1 2021, 10:59 AM

The __llvm_mipraw and __llvm_mipmap sections are now in the same group so that they are retained or removed together. This is important when using -ffunction-sections and --gc-sections.

Also, add 64 bit and 32 bit file types so that we can use 64 bit PC relative relocations.

ellis updated this revision to Diff 356058.Jul 1 2021, 5:38 PM

Hoist Raw Section Offset to mipmap header.

MTC added a subscriber: MTC.Sep 9 2021, 11:45 PM

I think it makes total sense to leverage existing PGO and achieve what's needed through extension rather setting up a new, disconnected framework. I've been working with @ellis and @kyulee offline to see how we can satisfies the key requirements within IRPGO framework.

It really comes down to two things: 1) extractable metadata; 2) coarse-grained instrumentation, and these don't warrant reinvent the wheel for every piece in the PGO pipeline. We've come up with a design that we think can achieve both within today's PGO framework, and we will send up a high level RFC soon. Thanks for the feedbacks and discussions.

ellis added a comment.Nov 1 2021, 9:42 AM

I think it makes total sense to leverage existing PGO and achieve what's needed through extension rather setting up a new, disconnected framework. I've been working with @ellis and @kyulee offline to see how we can satisfies the key requirements within IRPGO framework.

It really comes down to two things: 1) extractable metadata; 2) coarse-grained instrumentation, and these don't warrant reinvent the wheel for every piece in the PGO pipeline. We've come up with a design that we think can achieve both within today's PGO framework, and we will send up a high level RFC soon. Thanks for the feedbacks and discussions.

By the way, here is the RFC we put together.
https://groups.google.com/g/llvm-dev/c/r03Z6JoN7d4

ellis abandoned this revision.Dec 16 2021, 4:52 PM