Page MenuHomePhabricator

[InstrProf] Minimal Block Coverage
Needs ReviewPublic

Authored by ellis on Apr 26 2022, 5:14 PM.

Details

Summary

This diff implements minimal block coverage instrumentation. When the -pgo-block-coverage option is used, basic blocks will be instrumented for block coverage using single byte booleans. The coverage of some basic blocks can be inferred from others, so not every basic block is instrumented. In fact, we found that only ~60% of basic blocks need to be instrumented. These differences lead to less size overhead when compared to instrumenting block counts. For example, block coverage on the clang binary has an overhead of 20 Mi (17%) compared to 56 Mi (47%) with block counts.

Even though block coverage profiles have less precision than block count profiles, they can still be used to guide optimizations. In PGOUseFunc we use block coverage to populate edge weights such that BFI gives nonzero counts to only covered blocks. We do this by 1) setting the entry count of covered functions to a large value, i.e., 10000 and 2) populating edge weights using block coverage. In the next diff https://reviews.llvm.org/D125743 we use BFI to guide the machine outliner to avoid outlining covered blocks. This -pgo-block-coverage option provides a trade off of generating less precise profiles for faster and smaller instrumented binaries.

The BlockCoverageInference class defines the algorithm to find the minimal set of basic blocks that need to be instrumented for coverage. This is different from the Kirchhoff circuit law optimization that is used for edge counts because that does not work for block coverage. The reason for this is that edge counts can be added together to find a missing count while block coverage cannot since they store boolean values. So we need a new algorithm to find which blocks must be instrumented.

The details on this algorithm can be found in this paper titled "Minimum Coverage Instrumentation": https://arxiv.org/abs/2208.13907

Special thanks to Julian Mestre for creating this block coverage inference algorithm.

Binary size of clang using -O2:

  • Base
    • .text: 65.8 Mi
    • Total: 119 Mi
  • IRPGO (-fprofile-generate -mllvm -disable-vp -mllvm -debug-info-correlate)
    • .text: 93.0 Mi
    • __llvm_prf_cnts: 14.5 Mi
    • Total: 175 Mi
  • Minimal Block Coverage (-fprofile-generate -mllvm -disable-vp -mllvm -debug-info-correlate -mllvm -pgo-block-coverage)
    • .text: 82.1 Mi
    • __llvm_prf_cnts: 1.38 Mi
    • Total: 139 Mi

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires review to proceed.Jun 2 2022, 4:51 PM
ellis added inline comments.Jun 3 2022, 4:08 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
180

I don't know if this is optional or not, so I'm open to either way.

I don't expect to ever hit this assert unless there is a bug, so a user should not encounter this assert. If the assert is hit, then we will need to look at and understand the source anyway and I'm not sure how much added value we get from a readable message here.

ellis updated this revision to Diff 434191.Jun 3 2022, 4:09 PM

Improve comments, add more statistics, and remove the get() helper function.

(I plan to read this soon, and adding myself as a blocking reviewer.)

ellis added a comment.Jun 3 2022, 4:11 PM

(I plan to read this soon, and adding myself as a blocking reviewer.)

Ah that makes sense, thanks!

ellis added inline comments.Jun 3 2022, 4:53 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
222

The predecessor was just an arbitrary choice.

On second thought, this logic is not really needed. The original idea was to have every non-instrumented node depend on either some of its predecessors or some of its successors. This aligns well with the theory, but in practice we don't think this is needed. This may result in a node's coverage depending on more nodes than necessary, but this should be fine. I will remove this logic.

ellis updated this revision to Diff 434207.Jun 3 2022, 4:56 PM

Remove logic that enforced a non-instrumented node to either depend on its predecessors or successors, but not both. This logic was valid, but not necessary to produce a valid coverage instrumentation, so it was removed for simplicity.

Also, copy the interesting test case from compiler-rt to llvm. This allows us to more easily debug the complex test case in LLVM rather than compiler-rt.

ellis marked an inline comment as done.Jun 3 2022, 4:56 PM
MaskRay added inline comments.Jun 4 2022, 11:15 AM
llvm/include/llvm/Transforms/Instrumentation/BlockCoverageInference.h
16

LLVM_TRANSFORMS_INSTRUMENTATION_BLOCKCOVERAGEINFERENCE_H

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
49
51

Use a reference for non-null.

delete assert(F)

58
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
275

Delete , cl::init(false)

If this new instrumentation mode is for coverage (non-optimization), why modify PGOUseFunc? If just to use PGOVerifyBFI, it doesn't justify modifying PGOInstrumentation. It would be cleaner to create a new file based on my current understanding.

llvm/include/llvm/Transforms/Instrumentation/BlockCoverageInference.h
80
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
67
104
114

return F.size() == Visited.size()

121
130

BlockList is used only once. Remove using

132

This loop can be augmented with more info to include canFindMinimalCoverage, then just delete canFindMinimalCoverage

138

The time complexity is O(|BB|^2) and can be problematic.

146

llvm::any_of

200

return Path.count(Neighbors[0]) ? Neighbors[1] : Neighbors[0];

245

Use llvm::append_range

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1656

This can use a flood-fill instead of a fixed-point iteration.

MaskRay requested changes to this revision.Jun 4 2022, 12:12 PM
This revision now requires changes to proceed.Jun 4 2022, 12:12 PM
MaskRay added inline comments.Jun 4 2022, 12:44 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
150

I have other things to do and haven't read very carefully, but the impl here appears to differ from the algorithm description in the summary.

Having a SuperReachable Pred does not mean that PredecessorDependencies[&BB] should remain empty.

MaskRay added inline comments.Jun 4 2022, 12:50 PM
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
275

Also delete cl::ZeroOrMore. I have changed cl::opt that the option can be specified multiple times without an error.

ellis added a comment.Jun 6 2022, 11:30 AM

If this new instrumentation mode is for coverage (non-optimization), why modify PGOUseFunc? If just to use PGOVerifyBFI, it doesn't justify modifying PGOInstrumentation. It would be cleaner to create a new file based on my current understanding.

Perhaps I should update the description to better explain how this diff is used for optimization. I will update it shortly.

In PGOUseFunc we populate edge weights such that BlockFrequencyInfo reports uncovered blocks as cold and covered blocks as hot. Obviously this will not be as precise as populating real edge weights from block counts, but this is a trade off from using extremely lightweight instrumented binaries. In the next diff D125743 we can take advantage of these imprecise profiles by using block coverage to guide the machine outliner. We have found this to work quite well at reducing binary size without harming runtime performance.

ellis edited the summary of this revision. (Show Details)Jun 6 2022, 3:28 PM
ellis updated this revision to Diff 435403.Jun 8 2022, 6:58 PM
ellis marked 15 inline comments as done.

Make changes based on comments.

ellis planned changes to this revision.Jun 8 2022, 6:59 PM

I am planning changes for now. We might have a simpler algorithm that uses dominator trees. It also runs in linear time, so it's more efficient than the current one.

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
275

Nice 🙂

ellis updated this revision to Diff 459908.Sep 13 2022, 3:58 PM

Fix warning with Optional<BlockCoverageInference>

ellis edited the summary of this revision. (Show Details)Sep 13 2022, 4:10 PM
ellis added a reviewer: yozhu.
ellis updated this revision to Diff 459914.Sep 13 2022, 4:13 PM

Add link to paper

jmestre added inline comments.
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
150

As long as BB itself is reachable from EntryBlock, PredecessorDependencies should be non-empty whenever HasSuperReachablePred is true. Consider a simple path (i.e., there are no repeated nodes in the path) from EntryBlock to BB. Let Pred be the second to last node in this path. (Pred is well defined because HasSuperReachablePred is true implies BB != EntryBlock and so the path has at least one edge.) Then Pred belongs in RecheableFromEntry and so it will be added to PredecessorDependencies.
If it is desirable that the implementation handles functions with blocks that are not reachable from EntryBlock, then we can filter them out at the start and set their coverage status of these block to false right way.

ellis updated this revision to Diff 466673.Oct 10 2022, 5:56 PM

Rebase and add comment discussing the runtime.

ellis marked an inline comment as done.Oct 10 2022, 5:59 PM

Can we get another review of this? CC @MaskRay

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
138

I've added a comment discussing the runtime. We do know of a linear time algorithm that would compute the Dependencies maps, but it is significantly more complicated. We thought it would be best to first add this simpler implementation and in the future we can use the linear algorithm if we find that it gives a significant speedup in practice.

ellis added a comment.Oct 17 2022, 9:54 AM

Friendly ping :)

ellis updated this revision to Diff 475237.Nov 14 2022, 11:59 AM

Rebase.

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1656

I opted to use this algorithm for simplicity. Since most CFGs are relatively small, this function doesn't contribute too much to buildtime. If we discover that this causes a slowdown in some real world scenario then we can improve it. In fact, we can use info from the block coverage inference algorithm to infer all blocks in one pass. But again, this would add complexity and I would rather not do this unless we needed to.

@MaskRay, were you still planning to review this/do you have any unaddressed concerns?

ellis updated this revision to Diff 490914.Jan 20 2023, 11:11 AM

Rebase, use the std::optional() API, and initialize region counters for cover instructions as well as increment instructions.

smeenai added a comment.EditedWed, Feb 22, 5:37 PM

Ping @MaskRay. Have your blocking concerns been addressed?

This comment was removed by smeenai.

Ping @MaskRay. Have your blocking concerns been addressed?

Very sorry for missing these notifications! I'll read this patch soon...

Ping @MaskRay. Have your blocking concerns been addressed?

Very sorry for missing these notifications! I'll read this patch soon...

Thank you!

In BlockCoverageInference.cpp , for (auto &BB : F) { ... getReachableAvoiding is strictly quadratic and I think that may be problematic.
There are quite a few programs (e.g. Chisel genrerated C++) which contain functions with many basic blocks (say, >1000).
Also see D137184: a workaround for some programs with too many critical edges.

Applying a strict quadratic algorithm is a compile time pitfall. I understand that it may be fine for some mobile applications.
There are certainly other quadratic algorithms such as an O((V+E)*c) algorithm in GCC gcov (V: number of vertices; E: number of edges; c: number of elementary circuits),
an O(V^2) implementation of Semi-NCA algorithm.
For them, the quadratic time complexity is a theoretic upper bound which is really really difficult to achieve in practice (if ever achievable considering that in practice most CFGs are reducible).

Note that the optimality in the number of instrumented basic blocks is not required.
It does not necessarily lead to optimal performance since we have discarded execution count information.
Is it possible to use a faster algorithm which may instrument a few more basic blocks (let's arbitrarily say 1.2-approximation is good enough in practice)?

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
112

Rename ExitBlocks: BB may end with an unreachable/resume/cleanupret/etc. I think we don't name them exit blocks.

138

We have quadratic time algorithms like O((V+E)(c+1)) used in gcov.

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1656

I think it's worth fixing it. A flood-fill algorithm is nearly of the same length and avoids the performance pitfall. That style is used much more than the iterative algorithm here.

1718
ellis updated this revision to Diff 501634.Wed, Mar 1, 12:37 PM
ellis marked an inline comment as done.

Use flood-fill algorithm to infer block coverage and rename some compiler-rt tests.

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
112

Is TerminalBlocks a better name?

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1656

I've changes this to a flood-fill algorithm, but I needed to first compute InverseDependencies which is done in one pass.

ellis added inline comments.Wed, Mar 1, 12:43 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
138

I haven't forgotten about this, I just wanted to resolve the other comments first.

ellis added a comment.Tue, Mar 14, 5:53 PM

In BlockCoverageInference.cpp , for (auto &BB : F) { ... getReachableAvoiding is strictly quadratic and I think that may be problematic.
There are quite a few programs (e.g. Chisel genrerated C++) which contain functions with many basic blocks (say, >1000).
Also see D137184: a workaround for some programs with too many critical edges.

Applying a strict quadratic algorithm is a compile time pitfall. I understand that it may be fine for some mobile applications.
There are certainly other quadratic algorithms such as an O((V+E)*c) algorithm in GCC gcov (V: number of vertices; E: number of edges; c: number of elementary circuits),
an O(V^2) implementation of Semi-NCA algorithm.
For them, the quadratic time complexity is a theoretic upper bound which is really really difficult to achieve in practice (if ever achievable considering that in practice most CFGs are reducible).

Note that the optimality in the number of instrumented basic blocks is not required.
It does not necessarily lead to optimal performance since we have discarded execution count information.
Is it possible to use a faster algorithm which may instrument a few more basic blocks (let's arbitrarily say 1.2-approximation is good enough in practice)?

We have analyzed the runtime on several real-world programs that have large functions with many basic blocks. We found that the total runtime of the pgo-instr-gen pass for functions with less than 1.5K blocks is less than 5 seconds. For functions with between 4K and 10K basic blocks, the runtime was 150 seconds to 850 seconds.

Note that the intended use case is to instrument minimal block coverage on binaries interested in minimizing binary size so that we can better control the machine outliner. Those binaries likely won't have functions this large because inlining is likely restricted to reduce code size. For those special cases where functions are large (say, >1.5K blocks), I'd like to bail on this algorithm and instead instrument every block. This will keep the code complexity down while preventing build time regressions for the pathological cases.

ellis updated this revision to Diff 505341.Tue, Mar 14, 5:54 PM

Bail if the function has >1.5K blocks.

ellis added a comment.Wed, Mar 22, 5:13 PM

Hi @MaskRay do you have any more concerns?