This is an archive of the discontinued LLVM Phabricator instance.

[InstrProf] Minimal Block Coverage
ClosedPublic

Authored by ellis on Apr 26 2022, 5:14 PM.

Details

Summary

This diff implements minimal block coverage instrumentation. When the -pgo-block-coverage option is used, basic blocks will be instrumented for block coverage using single byte booleans. The coverage of some basic blocks can be inferred from others, so not every basic block is instrumented. In fact, we found that only ~60% of basic blocks need to be instrumented. These differences lead to less size overhead when compared to instrumenting block counts. For example, block coverage on the clang binary has an overhead of 20 Mi (17%) compared to 56 Mi (47%) with block counts.

Even though block coverage profiles have less precision than block count profiles, they can still be used to guide optimizations. In PGOUseFunc we use block coverage to populate edge weights such that BFI gives nonzero counts to only covered blocks. We do this by 1) setting the entry count of covered functions to a large value, i.e., 10000 and 2) populating edge weights using block coverage. In the next diff https://reviews.llvm.org/D125743 we use BFI to guide the machine outliner to avoid outlining covered blocks. This -pgo-block-coverage option provides a trade off of generating less precise profiles for faster and smaller instrumented binaries.

The BlockCoverageInference class defines the algorithm to find the minimal set of basic blocks that need to be instrumented for coverage. This is different from the Kirchhoff circuit law optimization that is used for edge counts because that does not work for block coverage. The reason for this is that edge counts can be added together to find a missing count while block coverage cannot since they store boolean values. So we need a new algorithm to find which blocks must be instrumented.

The details on this algorithm can be found in this paper titled "Minimum Coverage Instrumentation": https://arxiv.org/abs/2208.13907

Special thanks to Julian Mestre for creating this block coverage inference algorithm.

Binary size of clang using -O2:

  • Base
    • .text: 65.8 Mi
    • Total: 119 Mi
  • IRPGO (-fprofile-generate -mllvm -disable-vp -mllvm -debug-info-correlate)
    • .text: 93.0 Mi
    • __llvm_prf_cnts: 14.5 Mi
    • Total: 175 Mi
  • Minimal Block Coverage (-fprofile-generate -mllvm -disable-vp -mllvm -debug-info-correlate -mllvm -pgo-block-coverage)
    • .text: 82.1 Mi
    • __llvm_prf_cnts: 1.38 Mi
    • Total: 139 Mi

Diff Detail

Event Timeline

ellis created this revision.Apr 26 2022, 5:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2022, 5:14 PM
kyulee added a subscriber: kyulee.Apr 26 2022, 6:53 PM
ellis edited the summary of this revision. (Show Details)Apr 26 2022, 9:58 PM
ellis published this revision for review.May 16 2022, 11:18 AM
ellis edited the summary of this revision. (Show Details)
ellis added reviewers: phosek, spupyrev, kyulee, wenlei, davidxl.
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 16 2022, 11:19 AM
Herald added subscribers: llvm-commits, Restricted Project. · View Herald Transcript

Special thanks to Julian Mestre for creating this block coverage inference algorithm.

I haven't read the code yet but can you briefly describe it in the summary?

How is this compared with Kirchhoff circuit law optimization which requires integral execution counts?
Note: -fprofile-instr-generate is quite a bit slower because it does not have the optimization.

ellis edited the summary of this revision. (Show Details)May 16 2022, 12:01 PM

Special thanks to Julian Mestre for creating this block coverage inference algorithm.

I haven't read the code yet but can you briefly describe it in the summary?

How is this compared with Kirchhoff circuit law optimization which requires integral execution counts?
Note: -fprofile-instr-generate is quite a bit slower because it does not have the optimization.

I've updated the summary to explain the algorithm.

Do the outliner changes need to be a part of this patch? Or can you split the outliner changes out into a separate patch, and then add that separate patch as a child of this one?

gulfem added a subscriber: gulfem.May 16 2022, 6:32 PM
ellis updated this revision to Diff 429933.May 16 2022, 9:38 PM

Rebase and separate out machine outliner code into https://reviews.llvm.org/D125743

ellis edited the summary of this revision. (Show Details)May 16 2022, 9:39 PM
gulfem added a comment.EditedMay 17 2022, 5:48 PM

Hi @ellis,

This is exciting for us! We are interested in using single byte booleans for source-based code coverage, and I am currently working on the implementation.
We are trying to find a good solution to the problem that you described that missing counts cannot be inferred by doing arithmetic on other block counts when we use boolean counters.
For ex, in an if statement, else part is not instrumented, and its counter is inferred by subtracting parent and then counters.
When we use the boolean values for counters, this cannot be done anymore.
Simpler approach is to instrument more basic blocks, but we are exploring approaches that add counters to minimal blocks.
One of the biggest differences of coverage and PGO is that counters are emitted by traversing AST nodes in the front-end for coverage, whereas they are emitted by traversing CFG nodes for PGO.
The algorithm that you introduced is based on CFG traversal. I'm looking at that to see whether this can be repurposed for coverage.

A few questions:

  1. You mentioned that "we found that only ~60% of basic blocks need to be instrumented". With boolean counters, do you how much that has changed?
  2. Size reduction is great! Do you have any data on the impact of block coverage on compilation time and runtime performance? For coverage, we are also aiming runtime performance with boolean counters.
  3. Did you do any verification to compare the correctness of block coverage like comparing it against block counts?

It seems like we are trying to solve similar problems in different contexts, so we would be very interested in collaboration.

ellis added a comment.May 18 2022, 9:12 AM

Hi @ellis,

This is exciting for us! We are interested in using single byte booleans for source-based code coverage, and I am currently working on the implementation.
We are trying to find a good solution to the problem that you described that missing counts cannot be inferred by doing arithmetic on other block counts when we use boolean counters.
For ex, in an if statement, else part is not instrumented, and its counter is inferred by subtracting parent and then counters.
When we use the boolean values for counters, this cannot be done anymore.
Simpler approach is to instrument more basic blocks, but we are exploring approaches that add counters to minimal blocks.
One of the biggest differences of coverage and PGO is that counters are emitted by traversing AST nodes in the front-end for coverage, whereas they are emitted by traversing CFG nodes for PGO.
The algorithm that you introduced is based on CFG traversal. I'm looking at that to see whether this can be repurposed for coverage.

A few questions:

  1. You mentioned that "we found that only ~60% of basic blocks need to be instrumented". With boolean counters, do you how much that has changed?
  2. Size reduction is great! Do you have any data on the impact of block coverage on compilation time and runtime performance? For coverage, we are also aiming runtime performance with boolean counters.
  3. Did you do any verification to compare the correctness of block coverage like comparing it against block counts?

It seems like we are trying to solve similar problems in different contexts, so we would be very interested in collaboration.

I'm happy to see interest in coverage instrumentation!

  1. Let me clarify. The existing "Kirchhoff circuit law" optimization used for 64 bit counters allows us to instrument a subset of basic blocks. IIRC we need to instrument slightly more than 60% of basic blocks to do this. For the minimal block coverage algorithm I've implemented here, we also only need to instrument ~60% of basic blocks. Of course, the blocks we instrument are not the same in both algorithms, but the number of blocks is roughly the same.
  2. I haven't analyzed compilation time, but the theoretical runtime is O(|E| * |V|). For runtime performance, I also haven't measured this but my intuition is that this is very close to optimal since we are adding a single store in as few blocks as possible. In some cases there is some choice in which blocks to instrument and so we could try to make better decisions there, e.g., choose to instrument a rarely executed block rather than one in a tight loop.
  3. My colleague Julian Mestre actually proved that the algorithm produces a correct instrumentation and also that it produces a minimal instrumentation. So we cannot find a coverage that instruments fewer blocks. We are planning on publishing these results in a paper, but that might not be for some time.

I've implemented BlockCoverageInference to assume we are instrumenting Blocks in a CFG, but there is nothing special about blocks. This algorithm will work on any directed graph with entry and exit nodes. I think we can extend this class to support a more general graph and use it for AST coverage. Let me know if you think this could work.

I'm also wondering if we could use this PGO coverage to infer source-based code coverage by looking at debug info.

ellis updated this revision to Diff 430827.May 19 2022, 3:17 PM

Remove isBlockRarelyExecuted() API.
For covered functions, set the entry count to 10000 so that BFI correctly gives non-zero profile counts to only covered blocks.

gulfem added a comment.EditedMay 19 2022, 6:54 PM

Hi @ellis,

This is exciting for us! We are interested in using single byte booleans for source-based code coverage, and I am currently working on the implementation.
We are trying to find a good solution to the problem that you described that missing counts cannot be inferred by doing arithmetic on other block counts when we use boolean counters.
For ex, in an if statement, else part is not instrumented, and its counter is inferred by subtracting parent and then counters.
When we use the boolean values for counters, this cannot be done anymore.
Simpler approach is to instrument more basic blocks, but we are exploring approaches that add counters to minimal blocks.
One of the biggest differences of coverage and PGO is that counters are emitted by traversing AST nodes in the front-end for coverage, whereas they are emitted by traversing CFG nodes for PGO.
The algorithm that you introduced is based on CFG traversal. I'm looking at that to see whether this can be repurposed for coverage.

A few questions:

  1. You mentioned that "we found that only ~60% of basic blocks need to be instrumented". With boolean counters, do you how much that has changed?
  2. Size reduction is great! Do you have any data on the impact of block coverage on compilation time and runtime performance? For coverage, we are also aiming runtime performance with boolean counters.
  3. Did you do any verification to compare the correctness of block coverage like comparing it against block counts?

It seems like we are trying to solve similar problems in different contexts, so we would be very interested in collaboration.

I'm happy to see interest in coverage instrumentation!

  1. Let me clarify. The existing "Kirchhoff circuit law" optimization used for 64 bit counters allows us to instrument a subset of basic blocks. IIRC we need to instrument slightly more than 60% of basic blocks to do this. For the minimal block coverage algorithm I've implemented here, we also only need to instrument ~60% of basic blocks. Of course, the blocks we instrument are not the same in both algorithms, but the number of blocks is roughly the same.
  2. I haven't analyzed compilation time, but the theoretical runtime is O(|E| * |V|). For runtime performance, I also haven't measured this but my intuition is that this is very close to optimal since we are adding a single store in as few blocks as possible. In some cases there is some choice in which blocks to instrument and so we could try to make better decisions there, e.g., choose to instrument a rarely executed block rather than one in a tight loop.
  3. My colleague Julian Mestre actually proved that the algorithm produces a correct instrumentation and also that it produces a minimal instrumentation. So we cannot find a coverage that instruments fewer blocks. We are planning on publishing these results in a paper, but that might not be for some time.

That's great!

I've implemented BlockCoverageInference to assume we are instrumenting Blocks in a CFG, but there is nothing special about blocks. This algorithm will work on any directed graph with entry and exit nodes. I think we can extend this class to support a more general graph and use it for AST coverage. Let me know if you think this could work.

CoverageMappingGen.cpp adds counters while traversing AST. For ex, this is how counters are added for an if statement.
https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CoverageMappingGen.cpp#L1383
I'm currently building the prototype for using boolean counters in coverage, and I'm planning to upload a WIP review soon.
After that, we can discuss further whether BlockCoverageInference can be generalized.

I'm also wondering if we could use this PGO coverage to infer source-based code coverage by looking at debug info.

How are you planning to use block coverage for PGO?
In some of the tools that we show source-based coverage, we show whether a line is covered by the executed tests and do not show how many times each line is executed. For this specific purpose, we want to use single byte counters.
Although I'm not that familiar with PGO implementation, it seems like number counters might be more beneficial for making optimization decisions for PGO. Are there specific optimizations that you are planning to use block coverage?

ellis added a comment.May 19 2022, 7:33 PM

CoverageMappingGen.cpp adds counters while traversing AST. For ex, this is how counters are added for an if statement.
https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CoverageMappingGen.cpp#L1383
I'm currently building the prototype for using boolean counters in coverage, and I'm planning to upload a WIP review soon.
After that, we can discuss further whether BlockCoverageInference can be generalized.

I'm also wondering if we could use this PGO coverage to infer source-based code coverage by looking at debug info.

How are you planning to use block coverage for PGO?
In some of the tools that we show source-based coverage, we show whether a line is covered by the executed tests and do not show how many times each line is executed. For this specific purpose, we want to use single byte counters.
Although I'm not that familiar with PGO implementation, it seems like number counters might be more beneficial for making optimization decisions for PGO. Are there specific optimizations that you are planning to use block coverage?

The default PGO instruments counters and is useful for optimization. That problem is that this instrumentation is expensive with respect to binary size and runtime. This diff adds a block coverage mode which is much smaller and faster, although the data is not as precise. In the parent diff D125743 I've added support to use block coverage to outline blocks that are not covered and we may use this data to guide more optimizations in the future.

About the source-based coverage, my intuition is that adding coverage instrumentation to clang AST is less efficient than adding coverage instrumentation to blocks in the LLVM-IR like this diff does. In theory, I think it's possible to map covered LLVM-IR blocks to source code using debug info. Then you would only need to run this block instrumentation and get source coverage out of the profiles. Unfortunately, we don't maintain the binary address or source locations of instrumented blocks in our profiles, so this idea would require some format changes. I'm just throwing out ideas and I'm now realizing https://discourse.llvm.org/ would be a good place for this discussion. Feel free to reach out there if you'd like to talk about it more.

ellis updated this revision to Diff 431047.May 20 2022, 1:49 PM

Refactor BCI a bit and add a more interesting test case in compiler-rt.

spupyrev accepted this revision.Jun 2 2022, 4:48 PM

We reviewed the code internally prior sending upstream, and it is still looks good to me.

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
180

I have seen in some places assertions of type

assert(condition && "readable text message");

Is this optional or the accepted style?

222

why predecessor? is this for determinism?

This revision is now accepted and ready to land.Jun 2 2022, 4:48 PM
This revision now requires review to proceed.Jun 2 2022, 4:51 PM
ellis added inline comments.Jun 3 2022, 4:08 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
180

I don't know if this is optional or not, so I'm open to either way.

I don't expect to ever hit this assert unless there is a bug, so a user should not encounter this assert. If the assert is hit, then we will need to look at and understand the source anyway and I'm not sure how much added value we get from a readable message here.

ellis updated this revision to Diff 434191.Jun 3 2022, 4:09 PM

Improve comments, add more statistics, and remove the get() helper function.

(I plan to read this soon, and adding myself as a blocking reviewer.)

ellis added a comment.Jun 3 2022, 4:11 PM

(I plan to read this soon, and adding myself as a blocking reviewer.)

Ah that makes sense, thanks!

ellis added inline comments.Jun 3 2022, 4:53 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
222

The predecessor was just an arbitrary choice.

On second thought, this logic is not really needed. The original idea was to have every non-instrumented node depend on either some of its predecessors or some of its successors. This aligns well with the theory, but in practice we don't think this is needed. This may result in a node's coverage depending on more nodes than necessary, but this should be fine. I will remove this logic.

ellis updated this revision to Diff 434207.Jun 3 2022, 4:56 PM
ellis added a comment.Jun 3 2022, 4:56 PM

Remove logic that enforced a non-instrumented node to either depend on its predecessors or successors, but not both. This logic was valid, but not necessary to produce a valid coverage instrumentation, so it was removed for simplicity.

Also, copy the interesting test case from compiler-rt to llvm. This allows us to more easily debug the complex test case in LLVM rather than compiler-rt.

ellis marked an inline comment as done.Jun 3 2022, 4:56 PM
MaskRay added inline comments.Jun 4 2022, 11:15 AM
llvm/include/llvm/Transforms/Instrumentation/BlockCoverageInference.h
16

LLVM_TRANSFORMS_INSTRUMENTATION_BLOCKCOVERAGEINFERENCE_H

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
49
51

Use a reference for non-null.

delete assert(F)

58
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
273

Delete , cl::init(false)

If this new instrumentation mode is for coverage (non-optimization), why modify PGOUseFunc? If just to use PGOVerifyBFI, it doesn't justify modifying PGOInstrumentation. It would be cleaner to create a new file based on my current understanding.

llvm/include/llvm/Transforms/Instrumentation/BlockCoverageInference.h
80
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
67
104
114

return F.size() == Visited.size()

121
130

BlockList is used only once. Remove using

132

This loop can be augmented with more info to include canFindMinimalCoverage, then just delete canFindMinimalCoverage

138

The time complexity is O(|BB|^2) and can be problematic.

146

llvm::any_of

200

return Path.count(Neighbors[0]) ? Neighbors[1] : Neighbors[0];

245

Use llvm::append_range

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1654

This can use a flood-fill instead of a fixed-point iteration.

MaskRay requested changes to this revision.Jun 4 2022, 12:12 PM
This revision now requires changes to proceed.Jun 4 2022, 12:12 PM
MaskRay added inline comments.Jun 4 2022, 12:44 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
150

I have other things to do and haven't read very carefully, but the impl here appears to differ from the algorithm description in the summary.

Having a SuperReachable Pred does not mean that PredecessorDependencies[&BB] should remain empty.

MaskRay added inline comments.Jun 4 2022, 12:50 PM
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
273

Also delete cl::ZeroOrMore. I have changed cl::opt that the option can be specified multiple times without an error.

ellis added a comment.Jun 6 2022, 11:30 AM

If this new instrumentation mode is for coverage (non-optimization), why modify PGOUseFunc? If just to use PGOVerifyBFI, it doesn't justify modifying PGOInstrumentation. It would be cleaner to create a new file based on my current understanding.

Perhaps I should update the description to better explain how this diff is used for optimization. I will update it shortly.

In PGOUseFunc we populate edge weights such that BlockFrequencyInfo reports uncovered blocks as cold and covered blocks as hot. Obviously this will not be as precise as populating real edge weights from block counts, but this is a trade off from using extremely lightweight instrumented binaries. In the next diff D125743 we can take advantage of these imprecise profiles by using block coverage to guide the machine outliner. We have found this to work quite well at reducing binary size without harming runtime performance.

ellis edited the summary of this revision. (Show Details)Jun 6 2022, 3:28 PM
ellis updated this revision to Diff 435403.Jun 8 2022, 6:58 PM
ellis marked 15 inline comments as done.

Make changes based on comments.

ellis planned changes to this revision.Jun 8 2022, 6:59 PM

I am planning changes for now. We might have a simpler algorithm that uses dominator trees. It also runs in linear time, so it's more efficient than the current one.

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
273

Nice 🙂

ellis updated this revision to Diff 459908.Sep 13 2022, 3:58 PM

Fix warning with Optional<BlockCoverageInference>

ellis edited the summary of this revision. (Show Details)Sep 13 2022, 4:10 PM
ellis added a reviewer: yozhu.
ellis updated this revision to Diff 459914.Sep 13 2022, 4:13 PM

Add link to paper

jmestre added inline comments.
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
150

As long as BB itself is reachable from EntryBlock, PredecessorDependencies should be non-empty whenever HasSuperReachablePred is true. Consider a simple path (i.e., there are no repeated nodes in the path) from EntryBlock to BB. Let Pred be the second to last node in this path. (Pred is well defined because HasSuperReachablePred is true implies BB != EntryBlock and so the path has at least one edge.) Then Pred belongs in RecheableFromEntry and so it will be added to PredecessorDependencies.
If it is desirable that the implementation handles functions with blocks that are not reachable from EntryBlock, then we can filter them out at the start and set their coverage status of these block to false right way.

ellis updated this revision to Diff 466673.Oct 10 2022, 5:56 PM

Rebase and add comment discussing the runtime.

ellis marked an inline comment as done.Oct 10 2022, 5:59 PM

Can we get another review of this? CC @MaskRay

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
138

I've added a comment discussing the runtime. We do know of a linear time algorithm that would compute the Dependencies maps, but it is significantly more complicated. We thought it would be best to first add this simpler implementation and in the future we can use the linear algorithm if we find that it gives a significant speedup in practice.

ellis added a comment.Oct 17 2022, 9:54 AM

Friendly ping :)

ellis updated this revision to Diff 475237.Nov 14 2022, 11:59 AM

Rebase.

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1654

I opted to use this algorithm for simplicity. Since most CFGs are relatively small, this function doesn't contribute too much to buildtime. If we discover that this causes a slowdown in some real world scenario then we can improve it. In fact, we can use info from the block coverage inference algorithm to infer all blocks in one pass. But again, this would add complexity and I would rather not do this unless we needed to.

@MaskRay, were you still planning to review this/do you have any unaddressed concerns?

ellis updated this revision to Diff 490914.Jan 20 2023, 11:11 AM

Rebase, use the std::optional() API, and initialize region counters for cover instructions as well as increment instructions.

smeenai added a comment.EditedFeb 22 2023, 5:37 PM

Ping @MaskRay. Have your blocking concerns been addressed?

This comment was removed by smeenai.

Ping @MaskRay. Have your blocking concerns been addressed?

Very sorry for missing these notifications! I'll read this patch soon...

Ping @MaskRay. Have your blocking concerns been addressed?

Very sorry for missing these notifications! I'll read this patch soon...

Thank you!

In BlockCoverageInference.cpp , for (auto &BB : F) { ... getReachableAvoiding is strictly quadratic and I think that may be problematic.
There are quite a few programs (e.g. Chisel genrerated C++) which contain functions with many basic blocks (say, >1000).
Also see D137184: a workaround for some programs with too many critical edges.

Applying a strict quadratic algorithm is a compile time pitfall. I understand that it may be fine for some mobile applications.
There are certainly other quadratic algorithms such as an O((V+E)*c) algorithm in GCC gcov (V: number of vertices; E: number of edges; c: number of elementary circuits),
an O(V^2) implementation of Semi-NCA algorithm.
For them, the quadratic time complexity is a theoretic upper bound which is really really difficult to achieve in practice (if ever achievable considering that in practice most CFGs are reducible).

Note that the optimality in the number of instrumented basic blocks is not required.
It does not necessarily lead to optimal performance since we have discarded execution count information.
Is it possible to use a faster algorithm which may instrument a few more basic blocks (let's arbitrarily say 1.2-approximation is good enough in practice)?

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
112

Rename ExitBlocks: BB may end with an unreachable/resume/cleanupret/etc. I think we don't name them exit blocks.

138

We have quadratic time algorithms like O((V+E)(c+1)) used in gcov.

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1654

I think it's worth fixing it. A flood-fill algorithm is nearly of the same length and avoids the performance pitfall. That style is used much more than the iterative algorithm here.

1716
ellis updated this revision to Diff 501634.Mar 1 2023, 12:37 PM
ellis marked an inline comment as done.

Use flood-fill algorithm to infer block coverage and rename some compiler-rt tests.

llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
112

Is TerminalBlocks a better name?

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1654

I've changes this to a flood-fill algorithm, but I needed to first compute InverseDependencies which is done in one pass.

ellis added inline comments.Mar 1 2023, 12:43 PM
llvm/lib/Transforms/Instrumentation/BlockCoverageInference.cpp
138

I haven't forgotten about this, I just wanted to resolve the other comments first.

ellis added a comment.Mar 14 2023, 5:53 PM

In BlockCoverageInference.cpp , for (auto &BB : F) { ... getReachableAvoiding is strictly quadratic and I think that may be problematic.
There are quite a few programs (e.g. Chisel genrerated C++) which contain functions with many basic blocks (say, >1000).
Also see D137184: a workaround for some programs with too many critical edges.

Applying a strict quadratic algorithm is a compile time pitfall. I understand that it may be fine for some mobile applications.
There are certainly other quadratic algorithms such as an O((V+E)*c) algorithm in GCC gcov (V: number of vertices; E: number of edges; c: number of elementary circuits),
an O(V^2) implementation of Semi-NCA algorithm.
For them, the quadratic time complexity is a theoretic upper bound which is really really difficult to achieve in practice (if ever achievable considering that in practice most CFGs are reducible).

Note that the optimality in the number of instrumented basic blocks is not required.
It does not necessarily lead to optimal performance since we have discarded execution count information.
Is it possible to use a faster algorithm which may instrument a few more basic blocks (let's arbitrarily say 1.2-approximation is good enough in practice)?

We have analyzed the runtime on several real-world programs that have large functions with many basic blocks. We found that the total runtime of the pgo-instr-gen pass for functions with less than 1.5K blocks is less than 5 seconds. For functions with between 4K and 10K basic blocks, the runtime was 150 seconds to 850 seconds.

Note that the intended use case is to instrument minimal block coverage on binaries interested in minimizing binary size so that we can better control the machine outliner. Those binaries likely won't have functions this large because inlining is likely restricted to reduce code size. For those special cases where functions are large (say, >1.5K blocks), I'd like to bail on this algorithm and instead instrument every block. This will keep the code complexity down while preventing build time regressions for the pathological cases.

ellis updated this revision to Diff 505341.Mar 14 2023, 5:54 PM

Bail if the function has >1.5K blocks.

ellis added a comment.Mar 22 2023, 5:13 PM

Hi @MaskRay do you have any more concerns?

This looks good to me as this change is effective even for large programs.
This good change has been sitting for a while. I think it would be beneficial for the community to have it, so I approve it.
I also suggest following up any concerns and feedbacks after it lands.

kyulee accepted this revision.Mar 29 2023, 7:53 AM

This looks good to me as this change is effective even for large programs.
This good change has been sitting for a while. I think it would be beneficial for the community to have it, so I approve it.
I also suggest following up any concerns and feedbacks after it lands.

Just to clarify, as far as I can tell, all other comments have been addressed, and the one remaining concern is around using a faster algorithm. I believe that concern is mitigated by applying the size cutoff. You could argue that having the linear algorithm is cleaner than having a quadratic one with a cutoff, but it sounds like the linear algorithm would be significantly more complex, which is also a trade-off. For the intended application, where being optimal for size is really important and you're unlikely to have a function large enough to hit the cutoff in practice, going with the simpler algorithm seems like a reasonable trade-off.

This revision was not accepted when it landed; it landed in state Needs Review.Mar 29 2023, 4:24 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.