This is an archive of the discontinued LLVM Phabricator instance.

[llvm][CodeGen] Machine Function Splitter
ClosedPublic

Authored by snehasish on Aug 5 2020, 3:41 PM.

Details

Summary

We introduce a codegen optimization pass which splits functions into hot and cold
parts. This pass leverages the basic block sections feature recently
introduced in LLVM from the Propeller project. The pass targets
functions with profile coverage, identifies cold blocks and moves them
to a separate section. The linker groups all cold blocks across
functions together, decreasing fragmentation and improving icache and
itlb utilization.

We evaluated the Machine Function Splitter pass on clang bootstrap and SPECInt 2017.
For clang bootstrap we observe a mean 2.33% runtime improvement with a
~32% reduction in itlb and stlb misses. Additionally, l1 icache misses
reduced by 9.5% while l2 instruction misses reduced by 20%.
For SPECInt we report the change in IntRate the C/C++
benchmarks. All benchmarks apart from mcf and x264 improve, on average
by 0.6% with the max for deepsjeng at 1.6%.

Benchmark               % Change (IntRate)
500.perlbench_r          0.78
502.gcc_r                0.82
505.mcf_r               -0.30
520.omnetpp_r            0.18
523.xalancbmk_r          0.37
525.x264_r              -0.46
531.deepsjeng_r          1.61
541.leela_r              0.83
557.xz_r                 0.15

Diff Detail

Event Timeline

snehasish created this revision.Aug 5 2020, 3:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2020, 3:41 PM
snehasish requested review of this revision.Aug 5 2020, 3:42 PM
tmsriram added inline comments.Aug 5 2020, 3:55 PM
llvm/lib/CodeGen/BBSectionsPrepare.cpp
72

Should we rename the .h and .cpp so that they have the same prefix? Maybe BasicBlockSections.h and BasicBlockSections.cpp?

llvm/lib/CodeGen/TargetPassConfig.cpp
218

Why not call this split-machine-functions too for consistency?

llvm/test/CodeGen/X86/machine-function-splitter.ll
8

Also, check if the block is moved to the cold region has the expected call instruction?

We probably need to discuss how to make basic-block-section stuff work on non-X86 targets at some point, but I guess we don't have to do it in this patch if it's off by default.

I am wondering what is is your opinion on machine unroller/reroller? Aggressive loop unrolling may destroy code cache too.

Regarding reroller -- compiler with PGO will adjust the agressiveness of the unroller based on instruction workset size estimation. Doing this in later pass or in Propeller can help catch cases that are mis-handled.

davidxl added inline comments.Aug 5 2020, 5:23 PM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
89

Add an internal option here (the coldness threshold) for experimental purpose.

I also suggest add an option to specify programSummary based coldness threshold such as 99.99 percentile coldness. The default cutoff is 99.9999% defined in ProfileSummaryInfo.cpp: ProfileSummaryCutoffCold

dmajor added a subscriber: dmajor.Aug 5 2020, 6:12 PM

Please share performance numbers for publicly available workload(s).

llvm/lib/CodeGen/MachineFunctionSplitter.cpp
77

Do we need to renumber?

tmsriram added inline comments.Aug 6 2020, 8:21 PM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
77

Renumbering makes the sorting easy. The sorting will preserve the basic block order for the blocks that are not split.

snehasish updated this revision to Diff 283979.Aug 7 2020, 12:12 PM

Updated diff based on review comments.

  • Added two mllvm options to control cold count and threshold based split.
  • Added tests for the new options.
  • Updated test to check for the absence of unexpected blocks.
  • Renamed BBSectionsPrepare pass and rebased this diff on the change.
snehasish edited the summary of this revision. (Show Details)Aug 7 2020, 12:14 PM
snehasish edited the summary of this revision. (Show Details)
snehasish edited the summary of this revision. (Show Details)Aug 7 2020, 12:17 PM
snehasish marked 5 inline comments as done.Aug 7 2020, 12:24 PM
snehasish added inline comments.
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
77

We need to ensure that the order is preserved so that we don't perturb the decisions made by prior passes such as MachineBlockPlacement. Renumbering simplifies the code that needs to be shared with the BasicBlockSections pass.

llvm/lib/CodeGen/TargetPassConfig.cpp
218

We can't register two options with the same string, i.e. "split-machine-functions".

snehasish updated this revision to Diff 283991.Aug 7 2020, 12:29 PM
snehasish edited the summary of this revision. (Show Details)

Simplify the cold count check.

hiraditya added inline comments.Aug 7 2020, 5:23 PM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
77

I see we are renumbering both in MachineFunctionSplitter and BasicBlockSections

hiraditya added a comment.EditedAug 7 2020, 5:27 PM

We evaluated the Machine Function Splitter pass on clang bootstrap and SPECInt 2017.

Could you share the details of the machine as well? The improvements are well within noise.

For clang bootstrap we observe a mean 2.33% runtime improvement with a
~32% reduction in itlb and stlb misses

While itlb reduction looks quite impressive, it doesn't seem to translate quite well to the runtime improvement. Did we see consistent >2% improvement with multiple runs? Please share the numbers.

tmsriram added inline comments.Aug 7 2020, 5:32 PM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
77

The "bbsections-prepare" pass and the machine function splitter pass are intentionally made mutually exclusive. If bbsections is explicitly requested, machine function splitter does not apply. Please see the change in TargetPassConfig.cpp

Could you share the details of the machine as well?

Sure, these were measured on a Lenovo P920 workstation -- Intel Skylake based Xeon(R) Gold 6154 CPU.

The improvements are well within noise.

For SPEC, the reported intrate improvement numbers are an average across 5 iterations. Note that SPEC binaries are tiny in size may only improve code locality in some cases.

While itlb reduction looks quite impressive, it doesn't seem to translate quite well to the runtime improvement.

It stands to reason that removing the itlb bottleneck will expose the next one :) We could dig deeper by looking into how the top down profile changes with and without splitting.

Did we see consistent >2% improvement with multiple runs? Please share the numbers.

We see consistent 2%+ improvements over FDO optimized binaries. The numbers reported are averaged across 10 runs, here is the data for one such experiment where 500 invocations of clang were executed and the overall end to end user time was measured. For completeness, I have included the data for a hot-cold-split optimized binary as well. Note this particular experiment does not use ThinLTO for any of the builds since I had some trouble running the hot-cold-split pass with ThinLTO enabled.

|----------------------------------|----------------------|----------------|-----------|
|                                  | User time in seconds ($ time run-commands.sh)     |
|----------------------------------|----------------------|----------------|-----------|
| Run #                            | FDO baseline         | Hot cold split | MFS       |
|                                1 |               484.65 |          479.2 |    466.93 |
|                                2 |                483.4 |         478.28 |    470.25 |
|                                3 |               485.57 |         479.15 |    470.36 |
|                                4 |               480.37 |         480.34 |    469.85 |
|                                5 |               482.97 |         478.18 |    471.93 |
|                                6 |               484.06 |         479.74 |    473.27 |
|                                7 |               482.67 |         477.42 |    472.56 |
|                                8 |               483.53 |         476.99 |    474.58 |
|                                9 |               486.43 |         480.76 |    473.92 |
|                               10 |               489.94 |         480.11 |    471.42 |
|----------------------------------|----------------------|----------------|-----------|
| 2 Tail Paired T-Test vs Baseline |                      |      0.0000636 | 0.0000006 |
|----------------------------------|----------------------|----------------|-----------|
| Average                          |              484.359 |        479.017 |   471.507 |
|----------------------------------|----------------------|----------------|-----------|
| % Change                         |                      |           1.10 |      2.65 |
|----------------------------------|----------------------|----------------|-----------|

Here is the data for TLB and icache. Each event was collected independently along with instructions to ensure no multiplexing. The variance reported by perf was less than 1% for each event (often less than 0.5%).

|-----------|--------------------------------------------|--------------------------------------------------------|
|           | $ perf stat -r 3 -e frontend_retired.${EVENT}:u,instructions:u -- run-commands.sh                   |
|-----------|--------------------------------------------|--------------------------------------------------------|
|           | Machine Function Splitter                  | FDO Baseline                                           |
|-----------|--------------------------------------------|--------------------------------------------------------|
| EVENT     | Misses        | Instructions      | MPKI   | Misses         | Instructions      | MPKI   | % Change |
| itlb_miss | 1,411,325,040 | 1,618,495,692,919 | 0.8720 |  2,066,003,373 | 1,618,097,715,534 | 1.2768 |    31.70 |
| stlb_miss |   131,949,440 | 1,618,466,757,079 | 0.0815 |    195,471,938 | 1,618,061,281,016 | 0.1208 |    32.51 |
| l1i_miss  | 9,678,255,804 | 1,618,479,987,914 | 5.9798 | 10,698,143,090 | 1,618,081,273,918 | 6.6116 |     9.56 |
| l2_miss   |   434,287,963 | 1,618,443,723,597 | 0.2683 |    542,869,835 | 1,618,081,904,973 | 0.3355 |    20.02 |
|-----------|--------------------------------------------|--------------------------------------------------------|
snehasish updated this revision to Diff 284169.Aug 8 2020, 11:13 PM

Update PSI metadata to fix assert failure.

hiraditya added a subscriber: rjf.Aug 9 2020, 5:03 PM

Thanks for adding the results, could you share the script to measure bootstrap numbers?

In HCS the ability to keep cold functions in a separate section was added in: D85331 (cc: @rjf ), can we try with -mllvm -enable-cold-section to compare with MachineFuncionSplitter.

Thanks for adding the results, could you share the script to measure bootstrap numbers?

I've uploaded a Makefile here which will allow you to run the bootstrap benchmarks. Applying this patch on a local llvm repo and pointing the Makefile at it should be sufficient to get you going.

In HCS the ability to keep cold functions in a separate section was added in: D85331 (cc: @rjf ), can we try with -mllvm -enable-cold-section to compare with MachineFuncionSplitter.

We already incorporate this in our evaluation since we link using lld along with the flag -z,keep-text-section-prefix. Since the extracted functions are marked cold, they are assigned a .text.unlikely prefix. Passing -z,keep-text-section-prefix to lld ensures that these functions are placed in the appropriate output section achieving the same goal of improving locality for hot code. The impact of this can be seen in the binary characteristics we shared in the original RFC which showed a 41% and 47% decrease in size of .text and .text.hot respectively for the hot cold split pass.

We applied patch D85331 and find similar results. Comparing the sections of the binary (hot cold split vs PGO baseline), we find a new __llvm_cold section along with similar fractions of code extracted from .text and .text.hot --

   FILE SIZE        VM SIZE    
--------------  -------------- 
 [NEW] +7.31Mi  [NEW] +7.31Mi    __llvm_cold
  +64% +3.21Mi   +64% +3.21Mi    .eh_frame
  +26% +2.77Mi  [ = ]       0    .strtab
  +27%  +711Ki  [ = ]       0    .symtab
  +31%  +236Ki   +31%  +236Ki    .eh_frame_hdr
 +3.1%     +12  [ = ]       0    .shstrtab
  +29%      +9   +29%      +9    [LOAD #3 [RX]]
 +0.8%      +6  +0.8%      +6    [LOAD #2 [R]]
 -7.1%      -1  [ = ]       0    [Unmapped]
 -0.0%    -246  -0.0%    -246    .dynstr
 -0.1% -4.05Ki  -0.1% -4.05Ki    .rodata
 -1.4% -5.73Ki  -1.4% -5.73Ki    .text.startup
 -1.6%  -536Ki  -1.6%  -536Ki    .text.unlikely
-46.7% -2.51Mi -46.7% -2.51Mi    .text.hot
-43.8% -2.89Mi -43.8% -2.89Mi    .text
  +10% +8.28Mi  +7.2% +4.82Mi    TOTAL

Running the benchmarks with the patch enabled and comparing against the MachineFunctionSplitter. We see similar performance numbers --

|----------|----------|----------------|---------------------------|
| Run #    | Baseline | Hot Cold Split | Machine Function Splitter |
|----------|----------|----------------|---------------------------|
|        1 |   501.25 |         490.84 |                    489.48 |
|        2 |   504.22 |         491.66 |                    493.42 |
|        3 |   500.04 |          492.7 |                    489.18 |
|        4 |    499.4 |         493.31 |                    489.47 |
|        5 |   495.62 |          496.1 |                    488.79 |
|        6 |   500.62 |         495.61 |                    488.41 |
|        7 |   501.81 |         494.45 |                    487.67 |
|        8 |   496.96 |         495.91 |                    490.91 |
|        9 |   500.22 |         497.17 |                       488 |
|       10 |   499.66 |         493.81 |                    489.65 |
|----------|----------|----------------|---------------------------|
| Average  |   499.98 |        494.156 |                   489.498 |
| % Change |          |           1.16 |                      2.10 |
|----------|----------|----------------|---------------------------|
snehasish updated this revision to Diff 284560.Aug 10 2020, 7:21 PM

Add an option to exclude specific sections.

  • Add option -mfs-excluded-sections to allow users to specify section names to exclude.
  • Add a test for the option.

Overall approach looks good to me even when we don't see good SPEC-17 numbers as the optimization is intended to reduce page faults. The improvements would be more pronounced in large applications. I'll review the code in more detail in the next few days. Thanks for working on this.

hiraditya requested changes to this revision.Aug 14 2020, 12:10 AM

please run clang-format on the patch.

llvm/lib/CodeGen/BasicBlockSections.cpp
232 ↗(On Diff #284560)

nit: tab?

277 ↗(On Diff #284560)

Is this comment necessary?

This revision now requires changes to proceed.Aug 14 2020, 12:10 AM

Can we add more test cases to include eh_pad, invokeinst,

llvm/lib/CodeGen/MachineFunctionSplitter.cpp
114

std::find?

tmsriram added inline comments.Aug 18 2020, 12:35 PM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
112

I think excluding sections needs a bit more thought and we should do this as a separate patch if it is useful but I think a linker solution would be more favorable.

  1. From what I understand, when a user specifies section names using the section keyword, then the expectation is that all functions marked with that section name will be grouped together. With function splitting, since you attach the ".cold" suffix to such sections that are split, there is no guarantee that the linker will place them together as these are not prefixed as ".text".
  1. To overcome the above problem, the option to exclude such sections from being split is not ideal either as it moves the burden to the user to get this right with appropriate options.
  1. I think the temporary fix is to not split sections which are not prefixed as ".text". You can add a "FIXME:" comment here to describe why you are doing this. Moving forward, we can look at a linker solution where '.' is treated as a valid section name separator and sections with identical prefix before the "." are always grouped together even if they are not named ".text".
  1. I think we can move this handling as an enhancement in another patch.
tmsriram added inline comments.Aug 18 2020, 12:37 PM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
112

Correction: I meant ".unlikely" and not ".cold".

efriedma added inline comments.Aug 18 2020, 1:02 PM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
112

I think I'd rather make splitting for functions with an explicit section attribute opt-in, rather than opt-out. The user might have a strong need to emit a function in a particular section (for example, if the name is mentioned in a linker script). If someone is messing with section attributes in the first place, I'd like to be conservative by default.

llvm/lib/CodeGen/TargetPassConfig.cpp
218

Why do we need two options to control the same thing?

snehasish updated this revision to Diff 286459.Aug 18 2020, 7:31 PM

Address review comments.

  • Remove excluded section attribute option.
  • Don't split functions which have a section attribute.
  • Add a test for ehpads.
  • Remove redundant comments and clang-format.
snehasish marked an inline comment as done.Aug 18 2020, 8:09 PM

Thanks for the reviews!
@hiraditya I've added a test to ensure ehpads are not split out. Let me know if there are additional cases you want to cover.

llvm/lib/CodeGen/BasicBlockSections.cpp
232 ↗(On Diff #284560)

This seems to be the output from clang-format. The diff does not have a tab but it looks like phabricator is showing it as one?

llvm/lib/CodeGen/MachineFunctionSplitter.cpp
112

@tmsriram I think you meant to exclude such functions from being split. I agree, taking into consideration @efriedma's comment, I've removed the option and made this a conservative check on the section attribute. Any function with the section attribute set is not split.

114

Code referencing this comment was removed.

llvm/lib/CodeGen/TargetPassConfig.cpp
218

In this patch we added two options

  1. An option in llvm/lib/CodeGen/CommandFlags.cpp "split-machine-functions" so that llc can be used to invoke it in the tests.
  2. We added a temporary option in llvm/lib/CodeGen/TargetPassConfig.cpp so that it can be invoked when running with clang or lld (for LTO).

AFAICT we cant use (2) for tests and having (1) makes it easy to compile things without an intermediate llc step. We plan on removing (2) in a future patch which will add appropriate options to clang (-fsplit-machine-functions) and lld (--lto-split-machine-functions).

snehasish added inline comments.Aug 18 2020, 9:33 PM
llvm/lib/CodeGen/TargetPassConfig.cpp
218

Correction: We can use (2) for tests by passing "-enable-split-machine-functions" to llc however since we plan to introduce clang and lld flags in the near future it seems cleaner to leave the llc flag in place and just remove (2) when that happens rather than reintroduce it. WDYT?

efriedma added inline comments.
llvm/lib/CodeGen/TargetPassConfig.cpp
218

clang doesn't call RegisterCodeGenFlags? That seems like something we should consider changing.

snehasish added inline comments.Aug 20 2020, 1:05 PM
llvm/lib/CodeGen/TargetPassConfig.cpp
218

The clang driver does not register the codegen flags, the only clang tool which does is clang-fuzzer. A small patch like the one below would do the trick for basic functionality. More plumbing might be needed to print the appropriate flags from the driver. I think this is probably worth more discussion and beyond the scope of this patch.

diff --git a/clang/tools/driver/cc1as_main.cpp b/clang/tools/driver/cc1as_main.cpp
index 87047be3c2b..0b9b5673d3e 100644
--- a/clang/tools/driver/cc1as_main.cpp
+++ b/clang/tools/driver/cc1as_main.cpp
@@ -21,6 +21,7 @@
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/ADT/Triple.h"
+#include "llvm/CodeGen/CommandFlags.h"
 #include "llvm/IR/DataLayout.h"
 #include "llvm/MC/MCAsmBackend.h"
 #include "llvm/MC/MCAsmInfo.h"
@@ -61,6 +62,8 @@ using namespace clang::driver::options;
 using namespace llvm;
 using namespace llvm::opt;
 
+static codegen::RegisterCodeGenFlags CGF;
+
 namespace {
 
 /// Helper class for representing a single invocation of the assembler.
tmsriram accepted this revision.Aug 24 2020, 3:38 PM

LGTM, thanks for doing this!

llvm/lib/CodeGen/MachineFunctionSplitter.cpp
110

Maybe a comment here that this is useful while sorting the blocks?

118

Do we need a fixme comment here to say that we could split out landing pads if the exception patch for bb sections lands?

snehasish updated this revision to Diff 287785.Aug 25 2020, 3:41 PM

Add a comment to explain renumbering, FIXME for ehpads.

  • Added a comment to explain why renumbering blocks is necessary.
  • Added a FIXME and pointer to exceptions splitting patch.
snehasish marked 2 inline comments as done.Aug 25 2020, 3:52 PM

Thanks for the comments.
@efriedma @hiraditya - Let me know if you have any further comments, I will wait till EOD Thursday 08/26. If not I'll take that as go ahead to commit this change.

hiraditya added inline comments.Aug 28 2020, 8:43 AM
llvm/lib/CodeGen/MachineFunctionSplitter.cpp
124

nit: redundant braces,

llvm/lib/CodeGen/TargetPassConfig.cpp
216

Remove this FIXME. most passes continue to have cl::opt anyways.

nit: If FIXME's are mostly future works, then please replace them with TODOs.

Address reviewer comments.

  • Remove redundant braces.
  • Remove FIXME for cl::opt.
  • s/FIXME/TODO for future work.
snehasish marked 2 inline comments as done.Aug 28 2020, 10:04 AM

Thanks for the comments all! The builds look green and I'm going to go ahead and push this.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 28 2020, 11:13 AM
This revision was automatically updated to reflect the committed changes.