This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] accumulate bonus insts cost
ClosedPublic

Authored by yaxunl on Aug 22 2022, 12:59 PM.

Details

Summary

SimplifyCFG folds

bool foo() {
if (cond1) return false;
if (cond2) return false;
return true;
}

as

bool foo() {
  if (cond1 | cond2) return false
  return true;

'cond2' is called 'bonus insts' in branch folding since they introduce overhead
since the original CFG could do early exit but the folded CFG always executes
them. SimplifyCFG calculates the costs of 'bonus insts' of a folding a BB into
its predecessor BB which shares the destination. If it is below bonus-inst-threshold,
SimplifyCFG will fold that BB into its predecessor and cond2 will always be executed.

When SimplifyCFG calculates the cost of 'bonus insts', it only consider 'bonus' insts
in the current BB to be considered for folding. This causes issue for unrolled loops
which share destinations, e.g.

bool foo(int *a) {
  for (int i = 0; i < 32; i++)
    if (a[i] > 0) return false;
  return true;
}

After unrolling, it becomes

bool foo(int *a) {
  if(a[0]>0) return false
  if(a[1]>0) return false;
  //...
  if(a[31]>0) return false;
  return true;
}

SimplifyCFG will merge each BB with its predecessor BB,
and ends up with 32 'bonus insts' which are always executed, which
is much slower than the original CFG.

The root cause is that SimplifyCFG does not consider the
accumulated cost of 'bonus insts' which are folded from
different BB's.

This patch fixes that by introducing a DenseMap to track
costs of 'bonus insts' coming from different BB's into
the same BB, and cuts off if the accumulated cost
exceeds a threshold.

Diff Detail

Event Timeline

yaxunl created this revision.Aug 22 2022, 12:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2022, 12:59 PM
yaxunl requested review of this revision.Aug 22 2022, 12:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2022, 12:59 PM
Herald added a subscriber: wdng. · View Herald Transcript
yaxunl retitled this revision from [SimplifyCFG] accumuate bonus insts cost to [SimplifyCFG] accumulate bonus insts cost.Aug 22 2022, 1:03 PM
yaxunl edited the summary of this revision. (Show Details)
tra added a comment.Aug 22 2022, 2:12 PM
bool foo() {
  if (cond1 || cond2) return false
  return true;
}

Nit. I'd use arithmetic |. || would imply that short-circuiting would make both examples *exactly* the same -- cond2 would not be evaluated if cond1 is true.
Or just use pseudo-code that is distinctly not C/C++ to illustrate proposed transformation.

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
263–264

Do we ever need to clear the map carried by SimplifyCFGOpt?
Unlike the maps instantiated as a local variable in other places, the lifetime of SimplifyCFGOpt is less certain.

If we do need to clear them, should we be doing that to the map given to us by the user, or only to the local one?

yaxunl edited the summary of this revision. (Show Details)Aug 22 2022, 2:16 PM
yaxunl marked an inline comment as done.Aug 22 2022, 2:46 PM
yaxunl added inline comments.
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
263–264

Do we ever need to clear the map carried by SimplifyCFGOpt?
Unlike the maps instantiated as a local variable in other places, the lifetime of SimplifyCFGOpt is less certain.

If we do need to clear them, should we be doing that to the map given to us by the user, or only to the local one?

SimplifyCFGOpt is used by higher-level passes which work on functions or loops. The pass which uses SimplifyCFGOpt repeatedly is responsible for creating NumBonusInsts and passing it to SimplifyCFGOpt. NumBonusInsts is valid for the life cycle of the higher-level pass. It is only updated when there are new BB's folded. Once folded, the folded BB's are gone and will not be double counted in the future. Therefore NumBonusInsts needs not be emptied.

The LocalNumBonusInsts is always empty when SimplifyCFGOpt. It is used for situations where SimplifyCFG is used to simplify CFG's which do not fold branches or is not concerned with accumulated bonus insts. It degenerates to the original behaviour, i.e. only consider bonus insts of the current BB.

tra added inline comments.Aug 22 2022, 3:03 PM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
263–264

Once folded, the folded BB's are gone and will not be double counted in the future. Therefore NumBonusInsts needs not be emptied.

OK.

Further question -- is there a possibility that a BB we've recorded in the map gets deleted (e.g. after merging into another BB), and then another BB gets created, reusing the same pointer? That would result in erroneously counting in the old instance's bonus instructions.

It may not be likely to happen and the effect would be that SimplifyCfg would hit the bonus threshold sooner than it should have normally, so it's probably not fatal.
On the other hand, it may make compilation nondeterministic -- if/when we hit such a pointer reuse we'll end up generating different code and that's more of a problem.

The LocalNumBonusInsts is always empty when SimplifyCFGOpt.

Looks like something got lost during editing. I assume you meant "when SimplifyCFGOpt is constructed with user-supplied map".

yaxunl marked an inline comment as done.Aug 24 2022, 9:28 AM
yaxunl added inline comments.
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
263–264

Further question -- is there a possibility that a BB we've recorded in the map gets deleted (e.g. after merging into another BB), and then another BB gets created, reusing the same pointer? That would result in erroneously counting in the old instance's bonus instructions.

It may not be likely to happen and the effect would be that SimplifyCfg would hit the bonus threshold sooner than it should have normally, so it's probably not fatal.
On the other hand, it may make compilation nondeterministic -- if/when we hit such a pointer reuse we'll end up generating different code and that's more of a problem.

I can use CallbackVH, which will automatically update the map when the BB is deleted or replaced.

yaxunl added inline comments.Aug 25 2022, 2:32 PM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
263–264

Further question -- is there a possibility that a BB we've recorded in the map gets deleted (e.g. after merging into another BB), and then another BB gets created, reusing the same pointer? That would result in erroneously counting in the old instance's bonus instructions.

It may not be likely to happen and the effect would be that SimplifyCfg would hit the bonus threshold sooner than it should have normally, so it's probably not fatal.
On the other hand, it may make compilation nondeterministic -- if/when we hit such a pointer reuse we'll end up generating different code and that's more of a problem.

I can use CallbackVH, which will automatically update the map when the BB is deleted or replaced.

Actually, I will use ValueMap, which uses CallbackVH to remove the BB from the map when it is deleted. It is simpler.

yaxunl updated this revision to Diff 455717.Aug 25 2022, 2:35 PM

use ValueMap to track BasicBlock

yaxunl updated this revision to Diff 455899.Aug 26 2022, 7:04 AM

update tests

tra added a comment.Aug 26 2022, 10:52 AM

Nice. I like the consolidated CostTracker.

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
7283

This needs to be renamed, too.

yaxunl marked 2 inline comments as done.Aug 29 2022, 8:39 AM
yaxunl added inline comments.
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
7283

will do

yaxunl updated this revision to Diff 456356.Aug 29 2022, 8:39 AM
yaxunl marked an inline comment as done.

fix variable names

tra accepted this revision.Aug 29 2022, 10:53 AM
tra added inline comments.
llvm/test/Transforms/SimplifyCFG/branch-fold-unrolled-loop.ll
8 ↗(On Diff #456356)

It's not quite clear what's the purpose of this test. I.e. what are the important pieces of the checks below?

I'd add a comment describing expected transformation.

This revision is now accepted and ready to land.Aug 29 2022, 10:53 AM
fhahn added a subscriber: fhahn.Aug 29 2022, 10:58 AM
fhahn added inline comments.
llvm/test/Transforms/SimplifyCFG/branch-fold-unrolled-loop.ll
3 ↗(On Diff #456356)

Does the test depend on the target triple? If not , it should be removed. If it does it needs REQUIRES: to only run it when the AMDGPU backend is built.

Also, can you add the test in a separate patch and only include the changes caused by the patch in the diff so it is easier to see what's going on?

It also seems like the test coverage is quite limited. Are there some edge cases that should be covered by extra tests?

nikic added a subscriber: nikic.

Do you have any numbers for the impact of this change? I suspect (but haven't checked) that it may be quite significant, because the current tiny bonus inst threshold (1) is tuned for the current implementation, and this patch will result in much less common dest folding than would be desirable.

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
217

insert returns the iterator, no need to call find again.

225

You can use lookup() here, which will return 0 if not found.

yaxunl marked 4 inline comments as done.Aug 29 2022, 7:45 PM

Do you have any numbers for the impact of this change? I suspect (but haven't checked) that it may be quite significant, because the current tiny bonus inst threshold (1) is tuned for the current implementation, and this patch will result in much less common dest folding than would be desirable.

For a synthetic benchmark geared towards loops with small workloads, we saw around 16% performance gain due to avoiding always executing 31 extra load instructions by not folding 31 basic blocks coming from an unrolled loop.

For typical HIP programs, I did not see significant performance differences before and after this change. This is because the original bonus instruction threshold 2 is very small. It only allows folding small basic blocks which only contain one instruction other than the cmp/branch instructions. In practical applications, such basic blocks are rare and usually are not performance bottlenecks. Only in special situations (e.g. a fully unrolled loop of large loop counts ends up with a large number of foldable basic blocks) the performance penalty becomes significant.

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
217

insert returns the iterator, no need to call find again.

will do

225

You can use lookup() here, which will return 0 if not found.

will do

llvm/test/Transforms/SimplifyCFG/branch-fold-unrolled-loop.ll
3 ↗(On Diff #456356)

Does the test depend on the target triple? If not , it should be removed. If it does it needs REQUIRES: to only run it when the AMDGPU backend is built.

The test depends on the target triple. Will add REQUIRES.

Also, can you add the test in a separate patch and only include the changes caused by the patch in the diff so it is easier to see what's going on?

Will add this test in a separate patch and include the change only.

The difference is that: before this patch 3 BB's are folded into the first BB, accumulating 3 bonus insts; after this patch, only 2 BB's are folded, each accumulating 1 bonus inst.

It also seems like the test coverage is quite limited. Are there some edge cases that should be covered by extra tests?

Will add more tests for edge cases.

8 ↗(On Diff #456356)

It's not quite clear what's the purpose of this test. I.e. what are the important pieces of the checks below?

I'd add a comment describing expected transformation.

will add a comment about expected transformations.

yaxunl updated this revision to Diff 456532.Aug 29 2022, 8:27 PM
yaxunl marked 4 inline comments as done.

revised by Florian, Nikita, and Artem's comments.

yaxunl added a comment.Sep 6 2022, 8:34 AM

ping. Any further concerns? Thanks.

fhahn added a comment.Sep 6 2022, 8:36 AM

Do you have any numbers for the impact of this change? I suspect (but haven't checked) that it may be quite significant, because the current tiny bonus inst threshold (1) is tuned for the current implementation, and this patch will result in much less common dest folding than would be desirable.

For a synthetic benchmark geared towards loops with small workloads, we saw around 16% performance gain due to avoiding always executing 31 extra load instructions by not folding 31 basic blocks coming from an unrolled loop.

For typical HIP programs, I did not see significant performance differences before and after this change. This is because the original bonus instruction threshold 2 is very small. It only allows folding small basic blocks which only contain one instruction other than the cmp/branch instructions. In practical applications, such basic blocks are rare and usually are not performance bottlenecks. Only in special situations (e.g. a fully unrolled loop of large loop counts ends up with a large number of foldable basic blocks) the performance penalty becomes significant.

I am not sure what typical HIP programs mean here. Did you measure the impact on general CPU benchmarks?

yaxunl added a comment.Sep 6 2022, 8:58 AM

Do you have any numbers for the impact of this change? I suspect (but haven't checked) that it may be quite significant, because the current tiny bonus inst threshold (1) is tuned for the current implementation, and this patch will result in much less common dest folding than would be desirable.

For a synthetic benchmark geared towards loops with small workloads, we saw around 16% performance gain due to avoiding always executing 31 extra load instructions by not folding 31 basic blocks coming from an unrolled loop.

For typical HIP programs, I did not see significant performance differences before and after this change. This is because the original bonus instruction threshold 2 is very small. It only allows folding small basic blocks which only contain one instruction other than the cmp/branch instructions. In practical applications, such basic blocks are rare and usually are not performance bottlenecks. Only in special situations (e.g. a fully unrolled loop of large loop counts ends up with a large number of foldable basic blocks) the performance penalty becomes significant.

I am not sure what typical HIP programs mean here. Did you measure the impact on general CPU benchmarks?

No. Can you recommend a free CPU benchmark that is suitable for this purpose? Thanks.

yaxunl added a comment.Sep 6 2022, 2:57 PM

Do you have any numbers for the impact of this change? I suspect (but haven't checked) that it may be quite significant, because the current tiny bonus inst threshold (1) is tuned for the current implementation, and this patch will result in much less common dest folding than would be desirable.

For a synthetic benchmark geared towards loops with small workloads, we saw around 16% performance gain due to avoiding always executing 31 extra load instructions by not folding 31 basic blocks coming from an unrolled loop.

For typical HIP programs, I did not see significant performance differences before and after this change. This is because the original bonus instruction threshold 2 is very small. It only allows folding small basic blocks which only contain one instruction other than the cmp/branch instructions. In practical applications, such basic blocks are rare and usually are not performance bottlenecks. Only in special situations (e.g. a fully unrolled loop of large loop counts ends up with a large number of foldable basic blocks) the performance penalty becomes significant.

I am not sure what typical HIP programs mean here. Did you measure the impact on general CPU benchmarks?

No. Can you recommend a free CPU benchmark that is suitable for this purpose? Thanks.

I will try running the SPEC CPU benchmark. There are many sub-tests. Do you recommend any sub-tests that are most relevant to this patch? Thanks.

fhahn added a comment.Sep 7 2022, 12:38 AM

Do you have any numbers for the impact of this change? I suspect (but haven't checked) that it may be quite significant, because the current tiny bonus inst threshold (1) is tuned for the current implementation, and this patch will result in much less common dest folding than would be desirable.

For a synthetic benchmark geared towards loops with small workloads, we saw around 16% performance gain due to avoiding always executing 31 extra load instructions by not folding 31 basic blocks coming from an unrolled loop.

For typical HIP programs, I did not see significant performance differences before and after this change. This is because the original bonus instruction threshold 2 is very small. It only allows folding small basic blocks which only contain one instruction other than the cmp/branch instructions. In practical applications, such basic blocks are rare and usually are not performance bottlenecks. Only in special situations (e.g. a fully unrolled loop of large loop counts ends up with a large number of foldable basic blocks) the performance penalty becomes significant.

I am not sure what typical HIP programs mean here. Did you measure the impact on general CPU benchmarks?

No. Can you recommend a free CPU benchmark that is suitable for this purpose? Thanks.

I will try running the SPEC CPU benchmark. There are many sub-tests. Do you recommend any sub-tests that are most relevant to this patch? Thanks.

For SPEC, I usually run SPEC2017rate (both integer and fp subsets).

ronlieb added a subscriber: ronlieb.Sep 7 2022, 6:42 AM

Do you have any numbers for the impact of this change? I suspect (but haven't checked) that it may be quite significant, because the current tiny bonus inst threshold (1) is tuned for the current implementation, and this patch will result in much less common dest folding than would be desirable.

For a synthetic benchmark geared towards loops with small workloads, we saw around 16% performance gain due to avoiding always executing 31 extra load instructions by not folding 31 basic blocks coming from an unrolled loop.

For typical HIP programs, I did not see significant performance differences before and after this change. This is because the original bonus instruction threshold 2 is very small. It only allows folding small basic blocks which only contain one instruction other than the cmp/branch instructions. In practical applications, such basic blocks are rare and usually are not performance bottlenecks. Only in special situations (e.g. a fully unrolled loop of large loop counts ends up with a large number of foldable basic blocks) the performance penalty becomes significant.

I am not sure what typical HIP programs mean here. Did you measure the impact on general CPU benchmarks?

No. Can you recommend a free CPU benchmark that is suitable for this purpose? Thanks.

I will try running the SPEC CPU benchmark. There are many sub-tests. Do you recommend any sub-tests that are most relevant to this patch? Thanks.

For SPEC, I usually run SPEC2017rate (both integer and fp subsets).

Helping our author here: overnight, i ran cpu 2017 int fp speed rather than rate, with the patch and without. would that suffice ?
sharing summar of results, or if you really want rate, i can run the rate.
the speed fp was .34% faster on GeoMean, and speed int was 0.21% slower.
605.mcf 631.deepsjeng were above the 1.75% threshold
605.mcf -3.41%
631.deepsjeng 2.32%

yaxunl added a comment.Sep 8 2022, 1:53 PM

Helping our author here: overnight, i ran cpu 2017 int fp speed rather than rate, with the patch and without. would that suffice ?
sharing summar of results, or if you really want rate, i can run the rate.
the speed fp was .34% faster on GeoMean, and speed int was 0.21% slower.
605.mcf 631.deepsjeng were above the 1.75% threshold
605.mcf -3.41%
631.deepsjeng 2.32%

The following is the detailed results:

fp speed        before  after           rerun 3 before  rerun 3 after 
603.bwaves_s    525.95  527.75  0.34%     
607.cactuBSSN_s 242.86  246.16  1.36%     
619.lbm_s       67.54   67.55   0.01%     
621.wrf_s       88.79   89.74   1.07%     
627.cam4_s      124.71  123.57  -0.91%      
628.pop2_s      52.73   53.06   0.63%     
638.imagick_s   238.67  240.62  0.82%     
644.nab_s       370.06  367.73  -0.63%      
649.fotonik3d_s 82.96   83.76   0.96%     
654.roms_s      268.70  268.22  -0.18%      
geomean         158.36  158.90  0.34%     
            
int speed             
600.perlbench_s 4.34  4.32  -0.46%      
602.gcc_s       8.58  8.33  -2.91%    8.51  8.51  0.00%
605.mcf_s       6.73  6.50  -3.42%    6.74  6.51  -3.41%
620.omnetpp_s   3.53  3.46  -1.98%    3.54  3.51  -0.85%
623.xalancbmk_s 8.24  8.22  -0.24%      
625.x264_s      9.97  9.96  -0.10%      
631.deepsjeng_s 3.06  3.26  6.54%     3.02  3.09  2.32%
641.leela_s     3.79  3.83  1.06%     
648.exchange2_s 7.37  7.43  0.81%     
657.xz_s        19.65 19.45 -1.02%      
geomean         6.44  6.42  -0.21%

@fhahn Is this acceptable? Thanks.

arsenm added inline comments.Sep 9 2022, 6:50 AM
llvm/include/llvm/Transforms/Utils/Local.h
210

Shouldn't need llvm::

yaxunl marked an inline comment as done.Sep 12 2022, 10:28 AM
yaxunl added inline comments.
llvm/include/llvm/Transforms/Utils/Local.h
210

will remove

yaxunl updated this revision to Diff 459509.Sep 12 2022, 10:29 AM
yaxunl marked an inline comment as done.

remove redundant llvm::

ping. Any further concerns? Thanks.

yaxunl updated this revision to Diff 460440.Sep 15 2022, 9:22 AM

revise the test to be target independent

This revision was landed with ongoing or failed builds.Sep 18 2022, 5:24 PM
This revision was automatically updated to reflect the committed changes.
nikic added a comment.Sep 19 2022, 5:52 AM

I've reverted this change due to large compile-time regressions, see http://llvm-compile-time-tracker.com/compare.php?from=930315f6aa587ac962183708844eb2390d5ba55e&to=e5581df60a35fffb0c69589777e4e126c849405f&stat=instructions. I don't immediately see why it would be this expensive.

I've reverted this change due to large compile-time regressions, see http://llvm-compile-time-tracker.com/compare.php?from=930315f6aa587ac962183708844eb2390d5ba55e&to=e5581df60a35fffb0c69589777e4e126c849405f&stat=instructions. I don't immediately see why it would be this expensive.

Thanks. I will take a look.

This revision is now accepted and ready to land.Oct 24 2022, 2:14 PM

Since there are different opinions about this approach, I reverted this patch and opened an RFC for further discussion:

https://discourse.llvm.org/t/simplifycfg-track-branch-fold-costs-across-transformations/66191

Matt added a subscriber: Matt.Oct 25 2022, 11:28 AM