This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrep] Skip merging empty case blocks
ClosedPublic

Authored by junbuml on Jul 22 2016, 1:45 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
wmi
davidxl
joerg
manmanren
mcrosier

Commits

rG90b6b5074af4: [CodeGenPrep] Skip merging empty case blocks
rG85347dde27db: [CodeGenPrep] Skip merging empty case blocks
rG82f55c54460a: [CodeGenPrep] Skip merging empty case blocks
rL289988: [CodeGenPrep] Skip merging empty case blocks
rL289951: [CodeGenPrep] Skip merging empty case blocks
rL287553: [CodeGenPrep] Skip merging empty case blocks

Summary

Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.

Diff Detail

Event Timeline

junbuml updated this revision to Diff 65136.Jul 22 2016, 1:45 PM

junbuml retitled this revision from to [CodeGenPrep] Skip merging empty case blocks.

junbuml updated this object.

junbuml added reviewers: rengolin, t.p.northover, mcrosier.

junbuml added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptJul 22 2016, 1:45 PM

junbuml updated this object.Jul 22 2016, 1:49 PM

The change kind of makes sense to me, but I'd rather others with more experience on basic block merging to opine first.

I've added a couple of points inline. It would be good to see this behaviour in other targets, though.

It'd also be good to do a run on the test-suite to make sure there's no regressions (maybe code size?) in at least one target.

cheers,
--renato

lib/CodeGen/CodeGenPrepare.cpp
550	Why not just continue here?
test/CodeGen/AArch64/aarch64-skip-merging-case-block.ll
8 ↗	(On Diff #65136)	Please, put the check lines where the related sources are, down there with the IR. It's not clear how this lines test anything you implemented above.
22 ↗	(On Diff #65136)	This is a very convoluted test. Can't you make it simpler? From the description in the comments, you just need a switch in a loop, and you don't need any of those structs, or extern functions or anything.
164 ↗	(On Diff #65136)	Can you simplify your code to not need those functions? The change comment implies you can...
172 ↗	(On Diff #65136)	Do you really need the attributes to make the test exhibit the behaviour you're trying to fix?

Sorry for the long trip time on this. First, I make the test case very simple as Renato commented. I also make this applied conservatively because a branch instruction could be added in the empty case block. So, this will happen only when the frequency of empty case block is significantly lower than the frequency of header of switch.
Please take a look and let me know any comment.
Thanks,
Jun

lib/CodeGen/CodeGenPrepare.cpp
550	My intention was to perform continue for the outer for loop, not for the inner while loop.

junbuml added a reviewer: davidxl.Aug 19 2016, 7:20 AM

davidxl added inline comments.Aug 19 2016, 3:30 PM

lib/CodeGen/CodeGenPrepare.cpp
537	If the unique Predecessor of BB is terminated ..
540	Do you have actual examples showing the problem of extra copy instruction added? I could not connect the dots here.
557	How about the size impact (for Os build) ?

Address comments from David.

junbuml marked an inline comment as done.Aug 22 2016, 11:25 AM

junbuml added inline comments.

lib/CodeGen/CodeGenPrepare.cpp
537	Thanks !
540	In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which was reduced from the benchmark I was targeting. This test case should show copies in the header of switch which is in a loop and make the situation in CGP. I removed this test as it's unnecessary complex to be used as a test case. Please see aarch64-skip-merging-case-block.ll in Diff 1 and let me know if you want me to add the test in this patch.
557	This could potentially add a branch instruction, so we should do this when OptSize is false.

Fixing this in CGP seems like papering over real problems in ISEL or machine function optimization passes. The reason is that such code may be written by the user (not created by CGP): e.g.

switch( ..) {
   case 1:             // no empty block created for this
     break;
   case ..
         ...
  }

Probably, smarter ISel could handle this issue. I'm not sure if GlobalISel could cover such case when handing phi nodes. Considering that the original purpose of CGP is to better prepare IR for local ISel, I think handling this in CGP is not that bad idea while we rely on the local ISel. Do you have any machine function pass in your mind where we can deal with this issue?

davidxl edited edge metadata.Aug 26 2016, 4:14 PM

davidxl added a subscriber: hfinkel.

Kindly ping. Please let me know any comment on it.

Kindly ping. Is there any comment or suggestion?

We actually see similar issues introduced by SimplifyCFG which introduces new critical edges which later can not be properly split in PhiElimination which results in copy instructions be inserted into blocks with higher frequency. Until we have a better solution, this patch seems like a reasonable workaround.

Hal, what is your opinion?

davidxl added a subscriber: danielcdh.Sep 20 2016, 12:30 PM

Could you check if https://reviews.llvm.org/D24818 would fix your issue? I suppose when there are mutiple switch targets, each edge is likely to have <50% taken ratio (even with static profile). Thus I guess all critical edges should be splitted by machine sink. Or, if not, you can simply change 50% to 99% and see if it fixes the problem.

In D22696#547869, @davidxl wrote:

We actually see similar issues introduced by SimplifyCFG which introduces new critical edges which later can not be properly split in PhiElimination which results in copy instructions be inserted into blocks with higher frequency. Until we have a better solution, this patch seems like a reasonable workaround.

Hal, what is your opinion?

I've thought about this only for a few minutes, and I'd definitely like to know if @danielcdh's patch also fixes this, but what is the benefit of creating critical edges here in the first place (i.e. when the case block is likely to be executed)? Shouldn't we decide later whether to speculate instructions (including copies that come from PHIs) later at the MI level (based on instruction costs and profiling data) as an independent set of decisions?

I tried dehao's patch on the original test case -- unfortunately it does not work. The reason is that MachineSink pass fails to split the critical edge because the indirect branch (from switch lowering) in the source BB prevents the edge from being split because it is not analyzable. See MachineBasicBlock::canSplitCriticalEdge

I've thought about this only for a few minutes, and I'd definitely like to know if @danielcdh's patch also fixes this, but what is the benefit of creating critical edges here in the first place (i.e. when the case block is likely to be executed)? Shouldn't we decide later whether to speculate instructions (including copies that come from PHIs) later at the MI level (based on instruction costs and profiling data) as an independent set of decisions?

By skipping merging an empty case block, we could add an extra branch, so I do this only when the frequency of empty case block is significantly low using BFI. Based on the original purpose of CGP to better prepare IR for local ISel, doing this in CGP doesn't seem to be a really bad idea for me. However, handling this independently in MI level could be more reasonable.

I tried dehao's patch on the original test case -- unfortunately it does not work. The reason is that MachineSink pass fails to split the critical edge because the indirect branch (from switch lowering) in the source BB prevents the edge from being split because it is not analyzable. See MachineBasicBlock::canSplitCriticalEdge

Thank you very much David for the test. Yes, I also confirmed that @danielcdh's patch didn't fix the case I was targeting. Let me take a closer look at the MachineSink pass to see if I can come up with a feasible solution there.

Could you point me to the original test case? I tried on the unittest in this patch, my patch can create the critical edge. And looks to me by the time of MachineSink, the switch has already been lowered to branches, thus there is no indirect branch?

Given the limitation of MachineSink pass to split critical edges later (for the switch case), the effect of creating critical edge in CGP here can be quite detrimental, so we should probably make the patch more general -- instead of checking just switch inst in Predecessor, check indirect branch as well. Perhaps also adding some cost related heuristic -- by checking the number of phis (with incoming value from Pred) in the Succ block.

Here is what happens In this particular case:

  BB#7                   <--- indirect branch with jump table
/      \       \

BB4 BB5 EB

\ | /

BB2

EB gets eliminated by CGP

BB#7                   <--- indirect branch with jump table
/      \       \

BB4 BB5 |

\ | /

BB2

BB#2:

%vreg8<def> = PHI %vreg33, <BB#7>, %vreg19, d<BB#10>, ....
 %vreg9<def> = PHI %vreg34, <BB#7>, %vreg48, <BB#10>, ....
 %vreg10<def> = PHI %vreg35, <BB#7>, %vreg2, <BB#10>, ...
 %vreg11<def> = PHI %vreg36, <BB#7>, %vreg1, <BB#10>,  ....

BB#7:

%vreg36<def> = COPY %vreg38; GPR64all:%vreg36 GPR64:%vreg38
 %vreg35<def> = COPY %vreg39; GPR64all:%vreg35 GPR64:%vreg39
 %vreg33<def> = COPY %vreg40; GPR32all:%vreg33 GPR32:%vreg40
 %vreg34<def> = SUBREG_TO_REG 0, %vreg41, 15; GPR64all:%vreg34 GPR32:%vreg41
 %vreg43<def> = SUBREG_TO_REG 0, %vreg44<kill>, 15; GPR64:%vreg43 GPR32common:%vreg44
 %vreg47<def> = LDRXroX %vreg46, %vreg43<kill>, 0, 1; mem:LD8[JumpTable] GPR64:%vreg47,%vreg43 GPR64common:%vreg46
 BR %vreg47<kill>; GPR64:%vreg47

After PHI elimination, due to the critical edge, BB#7 ends up with many copy instructions:

BB#2:

 %vreg11<def> = COPY %vreg84<kill>; GPR64:%vreg11,%vreg84
%vreg10<def> = COPY %vreg83<kill>; GPR64:%vreg10,%vreg83
%vreg9<def> = COPY %vreg82<kill>; GPR64:%vreg9,%vreg82
%vreg8<def> = COPY %vreg81<kill>; GPR32all:%vreg8,%vreg81

 ....

BB#7:

%vreg81<def> = COPY %vreg33<kill>; GPR32all:%vreg81,%vreg33
   %vreg82<def> = COPY %vreg34<kill>; GPR64:%vreg82 GPR64all:%vreg34
   %vreg83<def> = COPY %vreg35<kill>; GPR64:%vreg83 GPR64all:%vreg35
   %vreg84<def> = COPY %vreg36<kill>; GPR64:%vreg84 GPR64all:%vreg36
   %vreg85<def> = COPY %vreg19; GPR32all:%vreg85 GPR32sp:%vreg19
   %vreg86<def> = COPY %vreg5; GPR64all:%vreg86,%vreg5
   %vreg87<def> = COPY %vreg2; GPR64all:%vreg87,%vreg2
   %vreg88<def> = COPY %vreg1; GPR64all:%vreg88,%vreg1

Some of these copy instructions which can not be sinked into less frequent path (due to failure in critical edge split) later become the address computation ADRP/MOV during virtual register rewrite.

Since there are multiple sinkable instructions in the source BB in this case, MachineSink without Dehao's patch (which handles single instruction case) should work if the edge is actually splitable.

@danielcdh: Could you point me to the original test case?

In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which was reduced from the benchmark I was targeting.

@davidxl: Given the limitation of MachineSink pass to split critical edges later (for the switch case), the effect of creating critical edge in CGP here can be quite detrimental, so we should probably make the patch more general -- instead of checking just switch inst in Predecessor, check indirect branch as well. Perhaps also adding some cost related heuristic -- by checking the number of phis (with incoming value from Pred) in the Succ block.

Thanks David for the detail. What you describe here is exactly what I observed. Based on your comment, I will update this change to avoid creating critical edges in case when potential sinkables cannot be handled in later pass (MachineSink); it's going to be more general to cover indirect branches as well, and a simple cost heuristic will be added based on the number of phis in Succ.

Updated the change based on David Li's comments.

davidxl added inline comments.Sep 30 2016, 5:19 PM

lib/CodeGen/CodeGenPrepare.cpp
562	in the header of switch --> in the predecessor of BB instead of BB (if it is not merged)
576	Since unique predecessor is checked here, so the PredBB's frequency is always no less than BB. Because of this, why don't skip the Frequency check (basically using ratio 1:1)?

davidxl added inline comments.Sep 30 2016, 7:41 PM

lib/CodeGen/CodeGenPrepare.cpp
576	On second thought, considering the cost of a direct branch, it is probably better to set the default frequency ratio to be >=2 . Also the check of number of phis should probably a 'OR' instead of 'AND'. The default value of MinNumPhiInDestToSkipMerge should be 2: if (Freq(Pred) >= FreqRatio*Freq(BB) \|\| NumCopyInsertionPHIs > MinNumPhis) return true; return false;

rengolin resigned from this revision.Oct 4 2016, 5:29 AM

rengolin removed a reviewer: rengolin.

On second thought, considering the cost of a direct branch, it is probably better to set the default frequency ratio to be >=2 . Also the check of number of phis should >probably a 'OR' instead of 'AND'. The default value of MinNumPhiInDestToSkipMerge should be 2:
if (Freq(Pred) >= FreqRatio*Freq(BB) || NumCopyInsertionPHIs > MinNumPhis)
return true;

Thanks David for the review. I ran spec2000/2006 with the change you suggested. Most of the score changes seems to be noise, but I can see -2% reproducible regression in spec2006/povray when we are aggressive in this (i.e., the default frequency ratio to be >=2). I can see more I-cache miss with "FreqRatio=2" in spec2006/povray. It seems that it skips merging empty cases even for a switch with small number of cases and the cost of branch is non-trivial especially when it mess up the I-cache.

I think we need to be more conservative for the frequency ratio so that we skip merging only when we are sure that the empty case is less frequently executed.

As of now, with/without this change (with FreqRatio=2), I can see 6.7% more L1 I-cache miss from perf-stat. I didn’t try to narrow down to the place in which the regression was caused, but I printed out FreqRatio, NumCopyInsertionPHIs, and NumOfCases when this change applied and I can see this was applied even when the number of cases in switch is small enough due to the aggressive FreqRatio=2. Please let me know if you want me to further narrow down.

FreqRatio: 160 , NumCopyInsertionPHIs : 1, NumCase: 5
FreqRatio: 102 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 786 , NumCopyInsertionPHIs : 16, NumCase: 793
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 2
FreqRatio: 128 , NumCopyInsertionPHIs : 1, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 32 , NumCopyInsertionPHIs : 1, NumCase: 7
FreqRatio: 32 , NumCopyInsertionPHIs : 1, NumCase: 8
FreqRatio: 32 , NumCopyInsertionPHIs : 3, NumCase: 4
FreqRatio: 96 , NumCopyInsertionPHIs : 0, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 0, NumCase: 2
FreqRatio: 39 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 1024 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 1024 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 39 , NumCopyInsertionPHIs : 1, NumCase: 32
FreqRatio: 864 , NumCopyInsertionPHIs : 0, NumCase: 27
FreqRatio: 32 , NumCopyInsertionPHIs : 0, NumCase: 2
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 193 , NumCopyInsertionPHIs : 1, NumCase: 6

can you isolate one small case and show the final code with/without the change?

Let me take a small switch applied by this change from povray .

I can see a couple of issues:

the cost of direct branch can be modelled better

static profile can get it quite wrong if the branch prediction heuristic is wrong. Do you see similar regression with PGO? I can see without PGO, we should indeed make the frequency ratio to be much larger (i.e., require more cases in switch).

I agree that we may need to improve the static heuristic especially for switch.
I haven't tried PGO for this change. I can run povray with PGO.
Yes, I believe we should be conservative enough in the frequency ratio in CGP so that the extra branch should be added only when we are sure that the empty case is certainly less frequently executed.

In PGO I can see just noise level regression (-0.5%) with this patch with FreqRatio=2.
In Evaluate_TPat(), I can see 10 empty case blocks were applied in this patch, resulting in 10 more branches in the final assembly in aarch64 with -mcpu=kryo. Since SimplifyCFG also merge empty blocks in TryToSimplifyUncondBranchFromEmptyBlock(), making a simple case hit in CGP require quite complex testcase. That's why I removed my first testcase in Diff1 (aarch64-skip-merging-case-block.ll). Please let me know if you really need a reduce testcase.
As of now, in this patch I believe we need to keep the default FreqRatio conservative, so I want to set it 1000, which I believe conservative enough.

Does it also enable more machine code sinking? If not, perhaps the heuristics to count copy insertion phis can be improved?

The initial purpose of this patch was to handle the cases which cannot be handled in MachineSink by letting ISel insert COPY in better place. By avoiding creating critical edge which is non-splittable in MachineSink, I think it could also indirectly increase the change of machine sinking. Since more number of copy insertion PHIs means more chances to sink in later pass, integrating the number of PHIs in the heuristics seems to be reasonable to me. However, I think the minimum frequency check should come first because we don't want to leave extra branch in a likely executed block regardless of the number of PHIs. To integrate NumCopyInsertionPHIs and minimum frequency requirement, I thought about the heuristics something like :

CurFreqRatio = Freq(PredBB) / Freq(BB);
if (CurFreqRatio > MinThresholdToSkipMerge && (NumCopyInsertionPHIs * CurFreqRatio) > MinNumPhiInDestToSkipMerge * MinThresholdToSkipMerge) {

// skip merging current empty case

}

Please let me know your thought.

How do you come up with 1000? It seems like specially tuned for that benchmark where the switch top is in a loop while the switch body is not -- the patch basically will never kick in for any switch in normal form without profile.

In case of the benchmark I was targeting, it has a huge switch in a loop and the switch have many empty cases only used as incoming blocks in PHIs. So, merging those empty cases cause many COPYs in the header of switch and MachineSink cannot sink them because it cannot split the critical edges across the jump table. Since the switch is huge, the ratio of FreqOfSwitchHeader to FreqOfEmptyCase was pretty large, more than 3000. In this patch I wanted to hit only such clear cases where the frequency of case is certainly lower than the switch header because the cost of extra branch added by skipping merging empty cases is sometime non-trivial especially when the case is likely to be taken as we observed in the povray case.

Integrated David's comments. Thanks David for your review with the detail comments.

Basically, I took the heuristic you suggested Freq(Pred) / Freq(BB) > 2. Please find the detail from the comment in the code.

Using the latest truck, ran performance test for spec2000/2006 and I didn't see regression even in spec2006.povray. The function applied by this patch is not even hot in my profile so the previous regression I observed must be caused by difference code alignment.

Regarding inspecting PHIs, for me, it seems not that easy to predict if a PHI ends up with a COPY in CodeGenPrepare, as COPYs might be added when performing deSSA and removed in later passes (e.g., Register Coalescing or Machine Copy Propagation). In this cost heuristic in CGP, I believe it's not unreasonable to see a PHI as one potential COPY. In the worst case where none of PHIs results in a COPY, the empty BB which is skipped here might end up with only one branch instruction, so it will be removed in BranchFold pass.

Regarding the multiple empty block case, when there are multiple empty blocks which are used as incoming blocks in the same PHI, we may be able to merge only one of them in most case. That is because of the conflict in incoming values in the PHI if one of them is already merged. I modified the testcase to introduce two empty blocks. In the first test, f_switch(), both two empty blocks are skipped as both of them are unlikely executed. In the second test, f_switch2(), once the first empty block (sw.bb) is merged, the second block(sw.bb2) cannot be merged because of the conflict in incoming value from sw.bb which is already merged. If all the incoming values are the same from the multiple empty blocks in the same PHI, then it should be either all merged or skipped, but I doubt if we can see such case in CGP in general. Please let me know if you were mentioning different cases in your previous comment.

Thanks,
Jun

Integrated David's comments by adding support to handle multiple empty blocks sharing the same incoming value for the PHIs in the DestBB. Added f_switch3() in the testcase, which will describe the case of multiple empty blocks. Please take a look and let me know any comments.

Thanks,
Jun

davidxl added inline comments.Nov 7 2016, 10:52 AM

lib/CodeGen/CodeGenPrepare.cpp
401	Nit: If --> Of
545	Use early return to reduce nesting level.
566	Find all other incoming blocks from which incoming values of all PHIs in DestBB are the same as the ones from BB.

Addressed comments from David Li and added fixes in test case failures in :

ragreedy-hoist-spill.ll
AArch64/widen_switch.ll
X86/widen_switch.ll  
phi-immediate-factoring.ll

Herald added a subscriber: qcolombet. · View Herald TranscriptNov 8 2016, 11:33 AM

junbuml added a reviewer: manmanren.Nov 8 2016, 11:34 AM

Hi Manman / Wei,

This change made a change in ragreedy-hoist-spill.ll which was modified in 8dbd561a and 815b02e9. Can you please review if the change in ragreedy-hoist-spill.ll is still make sense?
Thanks,
Jun

manmanren added inline comments.Nov 9 2016, 11:41 AM

test/CodeGen/X86/ragreedy-hoist-spill.ll
183	The whole purpose of this testing case is to make sure that a spill is not hoisted to a hotter outer loop. As long as that is still true with your change, it is fine.

junbuml marked an inline comment as done.Nov 9 2016, 12:31 PM

junbuml added inline comments.

test/CodeGen/X86/ragreedy-hoist-spill.ll
183	Thanks Manman. Yes, this change should not change such behavior. In this test, the spill is hoisted in sw.bb474, which is not the hotter outer loop; still even colder than the inner loop.

Kindly ping?

Do you know how often BPI/BFI is actually needed? In other words, among all functions compiled, how many of them reach to the point that BFI is needed? If you have some stats that will be great. If the data shows not often -- then BPI/BFI may need to be computed in a lazy way.

Do you know how often BPI/BFI is actually needed? In other words, among all functions compiled, how many of them reach to the point that BFI is needed? If you have some stats that will be great. If the data shows not often -- then BPI/BFI may need to be computed in a lazy way.

I'm quite sure that BPI/BFI would be rarely used as we specifically look for empty blocks which has unique predecessor terminated by SwitchInst or IndirectBrInst. We can compute BPI/BFI in isMergingEmptyBlockProfitable() and then cache it only for the current function. Please let me know if you see any down side of it?

Address David's comment. Now we compute BFI/BPI only when it need to be.

lgtm

This revision is now accepted and ready to land.Nov 19 2016, 12:14 PM

Closed by commit rL287553: [CodeGenPrep] Skip merging empty case blocks (authored by junbuml). · Explain WhyNov 21 2016, 8:57 AM

This revision was automatically updated to reflect the committed changes.

Reverted with r288052.

This change was reverted in r288052 due to the invalid loop info after eliminating an empty block. Now, I create LoopInfo when creating BFI/BPI so that it's not impacted by previous empty block eliminated.

junbuml reopened this revision.Nov 30 2016, 8:38 AM

This revision is now accepted and ready to land.Nov 30 2016, 8:38 AM

junbuml requested a review of this revision.Nov 30 2016, 8:39 AM

junbuml edited edge metadata.

junbuml added a reviewer: joerg.Nov 30 2016, 8:43 AM

Can you include the failing test case just to make sure no future changes will trigger this again?

Addressed Joerg's comment.

Kindly ping?

Kindly ping one more time.

This change was reverted due to the invalid loop info after eliminating an empty block. The only change I made is that I create a LoopInfo when creating BFI/BPI so that it's not impacted by previous empty block eliminated.

lgtm

This revision is now accepted and ready to land.Dec 15 2016, 10:01 AM

Closed by commit rL289951: [CodeGenPrep] Skip merging empty case blocks (authored by junbuml). · Explain WhyDec 16 2016, 8:13 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

CodeGen/

CodeGenPrepare.cpp

170 lines

test/

CodeGen/

X86/

phi-immediate-factoring.ll

3 lines

ragreedy-hoist-spill.ll

8 lines

Transforms/

CodeGenPrepare/

AArch64/

widen_switch.ll

6 lines

X86/

widen_switch.ll

6 lines

skip-merging-case-block.ll

144 lines

Diff 77220

lib/CodeGen/CodeGenPrepare.cpp

Show All 11 Lines
// basic-block-at-a-time approach. It should eventually be removed.		// basic-block-at-a-time approach. It should eventually be removed.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/CodeGen/Passes.h"		#include "llvm/CodeGen/Passes.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
		#include "llvm/Analysis/BlockFrequencyInfo.h"
		#include "llvm/Analysis/BranchProbabilityInfo.h"
#include "llvm/Analysis/InstructionSimplify.h"		#include "llvm/Analysis/InstructionSimplify.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/MemoryBuiltins.h"		#include "llvm/Analysis/MemoryBuiltins.h"
#include "llvm/CodeGen/Analysis.h"		#include "llvm/CodeGen/Analysis.h"
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
static cl::opt<bool> DisablePreheaderProtect(		static cl::opt<bool> DisablePreheaderProtect(
"disable-preheader-prot", cl::Hidden, cl::init(false),		"disable-preheader-prot", cl::Hidden, cl::init(false),
cl::desc("Disable protection against removing loop preheaders"));		cl::desc("Disable protection against removing loop preheaders"));

static cl::opt<bool> ProfileGuidedSectionPrefix(		static cl::opt<bool> ProfileGuidedSectionPrefix(
"profile-guided-section-prefix", cl::Hidden, cl::init(true),		"profile-guided-section-prefix", cl::Hidden, cl::init(true),
cl::desc("Use profile info to add section prefix for hot/cold functions"));		cl::desc("Use profile info to add section prefix for hot/cold functions"));

		static cl::opt<unsigned> FreqRatioToSkipMerge(
		"cgp-freq-ratio-to-skip-merge", cl::Hidden, cl::init(2),
		cl::desc("Skip merging empty blocks if (frequency of empty block) / "
		"(frequency of destination block) is greater than this ratio"));

namespace {		namespace {
typedef SmallPtrSet<Instruction *, 16> SetOfInstrs;		typedef SmallPtrSet<Instruction *, 16> SetOfInstrs;
typedef PointerIntPair<Type *, 1, bool> TypeIsSExt;		typedef PointerIntPair<Type *, 1, bool> TypeIsSExt;
typedef DenseMap<Instruction *, TypeIsSExt> InstrToOrigTy;		typedef DenseMap<Instruction *, TypeIsSExt> InstrToOrigTy;
class TypePromotionTransaction;		class TypePromotionTransaction;

class CodeGenPrepare : public FunctionPass {		class CodeGenPrepare : public FunctionPass {
const TargetMachine *TM;		const TargetMachine *TM;
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<ProfileSummaryInfoWrapperPass>();		AU.addRequired<ProfileSummaryInfoWrapperPass>();
AU.addRequired<TargetLibraryInfoWrapperPass>();		AU.addRequired<TargetLibraryInfoWrapperPass>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addRequired<LoopInfoWrapperPass>();		AU.addRequired<LoopInfoWrapperPass>();
}		}

private:		private:
bool eliminateFallThrough(Function &F);		bool eliminateFallThrough(Function &F);
bool eliminateMostlyEmptyBlocks(Function &F);		bool eliminateMostlyEmptyBlocks(Function &F, BlockFrequencyInfo &BFI);
		BasicBlock findDestBlockOfMergeableEmptyBlock(BasicBlock BB);
bool canMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;		bool canMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;
void eliminateMostlyEmptyBlock(BasicBlock *BB);		void eliminateMostlyEmptyBlock(BasicBlock *BB);
		bool isMergingEmptyBlockProfitable(BasicBlock BB, BasicBlock DestBB,
		bool isPreheader,
		BlockFrequencyInfo &BFI);
bool optimizeBlock(BasicBlock &BB, bool& ModifiedDT);		bool optimizeBlock(BasicBlock &BB, bool& ModifiedDT);
bool optimizeInst(Instruction *I, bool& ModifiedDT);		bool optimizeInst(Instruction *I, bool& ModifiedDT);
bool optimizeMemoryInst(Instruction I, Value Addr,		bool optimizeMemoryInst(Instruction I, Value Addr,
Type *AccessTy, unsigned AS);		Type *AccessTy, unsigned AS);
bool optimizeInlineAsmInst(CallInst *CS);		bool optimizeInlineAsmInst(CallInst *CS);
bool optimizeCallInst(CallInst *CI, bool& ModifiedDT);		bool optimizeCallInst(CallInst *CI, bool& ModifiedDT);
bool moveExtToFormExtLoad(Instruction *&I);		bool moveExtToFormExtLoad(Instruction *&I);
bool optimizeExtUses(Instruction *I);		bool optimizeExtUses(Instruction *I);
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	if (ProfileGuidedSectionPrefix) {
ProfileSummaryInfo *PSI =		ProfileSummaryInfo *PSI =
getAnalysis<ProfileSummaryInfoWrapperPass>().getPSI();		getAnalysis<ProfileSummaryInfoWrapperPass>().getPSI();
if (PSI->isFunctionEntryHot(&F))		if (PSI->isFunctionEntryHot(&F))
F.setSectionPrefix(".hot");		F.setSectionPrefix(".hot");
else if (PSI->isFunctionEntryCold(&F))		else if (PSI->isFunctionEntryCold(&F))
F.setSectionPrefix(".cold");		F.setSectionPrefix(".cold");
}		}

		BranchProbabilityInfo BPI(F, *LI);
		BlockFrequencyInfo BFI(F, BPI, *LI);

/// This optimization identifies DIV instructions that can be		/// This optimization identifies DIV instructions that can be
/// profitably bypassed and carried out with a shorter, faster divide.		/// profitably bypassed and carried out with a shorter, faster divide.
if (!OptSize && TLI && TLI->isSlowDivBypassed()) {		if (!OptSize && TLI && TLI->isSlowDivBypassed()) {
const DenseMap<unsigned int, unsigned int> &BypassWidths =		const DenseMap<unsigned int, unsigned int> &BypassWidths =
TLI->getBypassSlowDivWidths();		TLI->getBypassSlowDivWidths();
BasicBlock* BB = &*F.begin();		BasicBlock* BB = &*F.begin();
while (BB != nullptr) {		while (BB != nullptr) {
// bypassSlowDivision may create new BBs, but we don't want to reapply the		// bypassSlowDivision may create new BBs, but we don't want to reapply the
// optimization to those blocks.		// optimization to those blocks.
BasicBlock* Next = BB->getNextNode();		BasicBlock* Next = BB->getNextNode();
EverMadeChange \|= bypassSlowDivision(BB, BypassWidths);		EverMadeChange \|= bypassSlowDivision(BB, BypassWidths);
BB = Next;		BB = Next;
}		}
}		}

// Eliminate blocks that contain only PHI nodes and an		// Eliminate blocks that contain only PHI nodes and an
// unconditional branch.		// unconditional branch.
EverMadeChange \|= eliminateMostlyEmptyBlocks(F);		EverMadeChange \|= eliminateMostlyEmptyBlocks(F, BFI);

// llvm.dbg.value is far away from the value then iSel may not be able		// llvm.dbg.value is far away from the value then iSel may not be able
// handle it properly. iSel will drop llvm.dbg.value if it can not		// handle it properly. iSel will drop llvm.dbg.value if it can not
// find a node corresponding to the value.		// find a node corresponding to the value.
EverMadeChange \|= placeDbgValues(F);		EverMadeChange \|= placeDbgValues(F);

// If there is a mask, compare against zero, and branch that can be combined		// If there is a mask, compare against zero, and branch that can be combined
// into a single target instruction, push the mask and compare into branch		// into a single target instruction, push the mask and compare into branch
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	if (Term && !Term->isConditional()) {

// We have erased a block. Update the iterator.		// We have erased a block. Update the iterator.
I = BB->getIterator();		I = BB->getIterator();
}		}
}		}
return Changed;		return Changed;
}		}

/// Eliminate blocks that contain only PHI nodes, debug info directives, and an		/// Find a destination block from BB if BB is mergeable empty block.
/// unconditional branch. Passes before isel (e.g. LSR/loopsimplify) often split		BasicBlock CodeGenPrepare::findDestBlockOfMergeableEmptyBlock(BasicBlock BB) {
		davidxlUnsubmitted Done Reply Inline Actions Nit: If --> Of davidxl: Nit: If --> Of
/// edges in ways that are non-optimal for isel. Start by eliminating these
/// blocks so we can split them the way we want them.
bool CodeGenPrepare::eliminateMostlyEmptyBlocks(Function &F) {
SmallPtrSet<BasicBlock *, 16> Preheaders;
SmallVector<Loop *, 16> LoopList(LI->begin(), LI->end());
while (!LoopList.empty()) {
Loop *L = LoopList.pop_back_val();
LoopList.insert(LoopList.end(), L->begin(), L->end());
if (BasicBlock *Preheader = L->getLoopPreheader())
Preheaders.insert(Preheader);
}

bool MadeChange = false;
// Note that this intentionally skips the entry block.
for (Function::iterator I = std::next(F.begin()), E = F.end(); I != E;) {
BasicBlock BB = &I++;

// If this block doesn't end with an uncond branch, ignore it.		// If this block doesn't end with an uncond branch, ignore it.
BranchInst *BI = dyn_cast<BranchInst>(BB->getTerminator());		BranchInst *BI = dyn_cast<BranchInst>(BB->getTerminator());
if (!BI \|\| !BI->isUnconditional())		if (!BI \|\| !BI->isUnconditional())
continue;		return nullptr;

// If the instruction before the branch (skipping debug info) isn't a phi		// If the instruction before the branch (skipping debug info) isn't a phi
// node, then other stuff is happening here.		// node, then other stuff is happening here.
BasicBlock::iterator BBI = BI->getIterator();		BasicBlock::iterator BBI = BI->getIterator();
if (BBI != BB->begin()) {		if (BBI != BB->begin()) {
--BBI;		--BBI;
while (isa<DbgInfoIntrinsic>(BBI)) {		while (isa<DbgInfoIntrinsic>(BBI)) {
if (BBI == BB->begin())		if (BBI == BB->begin())
break;		break;
--BBI;		--BBI;
}		}
if (!isa<DbgInfoIntrinsic>(BBI) && !isa<PHINode>(BBI))		if (!isa<DbgInfoIntrinsic>(BBI) && !isa<PHINode>(BBI))
continue;		return nullptr;
}		}

// Do not break infinite loops.		// Do not break infinite loops.
BasicBlock *DestBB = BI->getSuccessor(0);		BasicBlock *DestBB = BI->getSuccessor(0);
if (DestBB == BB)		if (DestBB == BB)
continue;		return nullptr;

if (!canMergeBlocks(BB, DestBB))		if (!canMergeBlocks(BB, DestBB))
continue;		DestBB = nullptr;

// Do not delete loop preheaders if doing so would create a critical edge.		return DestBB;
// Loop preheaders can be good locations to spill registers. If the		}
// preheader is deleted and we create a critical edge, registers may be
// spilled in the loop body instead.		/// Eliminate blocks that contain only PHI nodes, debug info directives, and an
if (!DisablePreheaderProtect && Preheaders.count(BB) &&		/// unconditional branch. Passes before isel (e.g. LSR/loopsimplify) often split
!(BB->getSinglePredecessor() && BB->getSinglePredecessor()->getSingleSuccessor()))		/// edges in ways that are non-optimal for isel. Start by eliminating these
		/// blocks so we can split them the way we want them.
		bool CodeGenPrepare::eliminateMostlyEmptyBlocks(Function &F,
		BlockFrequencyInfo &BFI) {
		SmallPtrSet<BasicBlock *, 16> Preheaders;
		SmallVector<Loop *, 16> LoopList(LI->begin(), LI->end());
		while (!LoopList.empty()) {
		Loop *L = LoopList.pop_back_val();
		LoopList.insert(LoopList.end(), L->begin(), L->end());
		if (BasicBlock *Preheader = L->getLoopPreheader())
		Preheaders.insert(Preheader);
		}

		bool MadeChange = false;
		// Note that this intentionally skips the entry block.
		for (Function::iterator I = std::next(F.begin()), E = F.end(); I != E;) {
		BasicBlock BB = &I++;
		BasicBlock *DestBB = findDestBlockOfMergeableEmptyBlock(BB);
		if (!DestBB \|\|
		!isMergingEmptyBlockProfitable(BB, DestBB, Preheaders.count(BB), BFI))
continue;		continue;

eliminateMostlyEmptyBlock(BB);		eliminateMostlyEmptyBlock(BB);
MadeChange = true;		MadeChange = true;
}		}
return MadeChange;		return MadeChange;
}		}

		bool CodeGenPrepare::isMergingEmptyBlockProfitable(BasicBlock *BB,
		BasicBlock *DestBB,
		bool isPreheader,
		BlockFrequencyInfo &BFI) {
		// Do not delete loop preheaders if doing so would create a critical edge.
		// Loop preheaders can be good locations to spill registers. If the
		// preheader is deleted and we create a critical edge, registers may be
		// spilled in the loop body instead.
		if (!DisablePreheaderProtect && isPreheader &&
		!(BB->getSinglePredecessor() &&
		BB->getSinglePredecessor()->getSingleSuccessor()))
		return false;

		// Try to skip merging if the unique predecessor of BB is terminated by a
		// switch or indirect branch instruction, and BB is used as an incoming block
		// of PHIs in DestBB. In such case, merging BB and DestBB would cause ISel to
		// add COPY instructions in the predecessor of BB instead of BB (if it is not
		// merged). Note that the critical edge created by merging such blocks wont be
		// split in MachineSink because the jump table is not analyzable. By keeping
		// such empty block (BB), ISel will place COPY instructions in BB, not in the
		// predecessor of BB.
		BasicBlock *Pred = BB->getUniquePredecessor();
		if (!Pred \|\|
		!(isa<SwitchInst>(Pred->getTerminator()) \|\|
		isa<IndirectBrInst>(Pred->getTerminator())))
		return true;

		if (BB->getTerminator() != BB->getFirstNonPHI())
		return true;

		// We use a simple cost heuristic which determine skipping merging is
		// profitable if the cost of skipping merging is less than the cost of
		// merging : Cost(skipping merging) < Cost(merging BB), where the
		// Cost(skipping merging) is Freq(BB) * (Cost(Copy) + Cost(Branch)), and
		// the Cost(merging BB) is Freq(Pred) * Cost(Copy).
		// Assuming Cost(Copy) == Cost(Branch), we could simplify it to :
		// Freq(Pred) / Freq(BB) > 2.
		// Note that if there are multiple empty blocks sharing the same incoming
		// value for the PHIs in the DestBB, we consider them together. In such
		// case, Cost(merging BB) will be the sum of their frequencies.

		if (!isa<PHINode>(DestBB->begin()))
		return true;

		BlockFrequency PredFreq = BFI.getBlockFreq(Pred);
		BlockFrequency BBFreq = BFI.getBlockFreq(BB);
		SmallPtrSet<BasicBlock *, 16> SameIncomingValueBBs;

		// Find all other incoming blocks from which incoming values of all PHIs in
		// DestBB are the same as the ones from BB.
		for (pred_iterator PI = pred_begin(DestBB), E = pred_end(DestBB); PI != E;
		++PI) {
		BasicBlock DestBBPred = PI;
		if (DestBBPred == BB)
		continue;

		bool HasAllSameValue = true;
		BasicBlock::const_iterator DestBBI = DestBB->begin();
		while (const PHINode *DestPN = dyn_cast<PHINode>(DestBBI++)) {
		if (DestPN->getIncomingValueForBlock(BB) !=
		DestPN->getIncomingValueForBlock(DestBBPred)) {
		HasAllSameValue = false;
		break;
		}
		}
		if (HasAllSameValue)
		SameIncomingValueBBs.insert(DestBBPred);
		}

		// See if all BB's incoming values are same as the value from Pred. In this
		// case, no reason to skip merging because COPYs are expected to be place in
		// Pred already.
		if (SameIncomingValueBBs.count(Pred))
		return true;

		for (auto SameValueBB : SameIncomingValueBBs)
		davidxlUnsubmitted Done Reply Inline Actions If the unique Predecessor of BB is terminated .. davidxl: If the unique Predecessor of BB is terminated ..
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions Thanks ! junbuml: Thanks !
		if (SameValueBB->getUniquePredecessor() == Pred &&
		DestBB == findDestBlockOfMergeableEmptyBlock(SameValueBB))
		BBFreq += BFI.getBlockFreq(SameValueBB);
		davidxlUnsubmitted Not Done Reply Inline Actions Do you have actual examples showing the problem of extra copy instruction added? I could not connect the dots here. davidxl: Do you have actual examples showing the problem of extra copy instruction added? I could not…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which was reduced from the benchmark I was targeting. This test case should show copies in the header of switch which is in a loop and make the situation in CGP. I removed this test as it's unnecessary complex to be used as a test case. Please see aarch64-skip-merging-case-block.ll in Diff 1 and let me know if you want me to add the test in this patch. junbuml: In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which…

		return PredFreq.getFrequency() <=
		BBFreq.getFrequency() * FreqRatioToSkipMerge;
		}

		davidxlUnsubmitted Done Reply Inline Actions Use early return to reduce nesting level. davidxl: Use early return to reduce nesting level.
/// Return true if we can merge BB into DestBB if there is a single		/// Return true if we can merge BB into DestBB if there is a single
/// unconditional branch between them, and BB contains no other non-phi		/// unconditional branch between them, and BB contains no other non-phi
/// instructions.		/// instructions.
bool CodeGenPrepare::canMergeBlocks(const BasicBlock *BB,		bool CodeGenPrepare::canMergeBlocks(const BasicBlock *BB,
const BasicBlock *DestBB) const {		const BasicBlock *DestBB) const {
		rengolinUnsubmitted Done Reply Inline Actions Why not just continue here? rengolin: Why not just continue here?
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions My intention was to perform continue for the outer for loop, not for the inner while loop. junbuml: My intention was to perform continue for the outer for loop, not for the inner while loop.
// We only want to eliminate blocks whose phi nodes are used by phi nodes in		// We only want to eliminate blocks whose phi nodes are used by phi nodes in
// the successor. If there are more complex condition (e.g. preheaders),		// the successor. If there are more complex condition (e.g. preheaders),
// don't mess around with them.		// don't mess around with them.
BasicBlock::const_iterator BBI = BB->begin();		BasicBlock::const_iterator BBI = BB->begin();
while (const PHINode *PN = dyn_cast<PHINode>(BBI++)) {		while (const PHINode *PN = dyn_cast<PHINode>(BBI++)) {
for (const User *U : PN->users()) {		for (const User *U : PN->users()) {
const Instruction *UI = cast<Instruction>(U);		const Instruction *UI = cast<Instruction>(U);
		davidxlUnsubmitted Done Reply Inline Actions How about the size impact (for Os build) ? davidxl: How about the size impact (for Os build) ?
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions This could potentially add a branch instruction, so we should do this when OptSize is false. junbuml: This could potentially add a branch instruction, so we should do this when OptSize is false.
if (UI->getParent() != DestBB \|\| !isa<PHINode>(UI))		if (UI->getParent() != DestBB \|\| !isa<PHINode>(UI))
return false;		return false;
// If User is inside DestBB block and it is a PHINode then check		// If User is inside DestBB block and it is a PHINode then check
// incoming value. If incoming value is not from BB then this is		// incoming value. If incoming value is not from BB then this is
// a complex condition (e.g. preheaders) we want to avoid here.		// a complex condition (e.g. preheaders) we want to avoid here.
		davidxlUnsubmitted Done Reply Inline Actions in the header of switch --> in the predecessor of BB instead of BB (if it is not merged) davidxl: in the header of switch --> in the predecessor of BB instead of BB (if it is not merged)
if (UI->getParent() == DestBB) {		if (UI->getParent() == DestBB) {
if (const PHINode *UPN = dyn_cast<PHINode>(UI))		if (const PHINode *UPN = dyn_cast<PHINode>(UI))
for (unsigned I = 0, E = UPN->getNumIncomingValues(); I != E; ++I) {		for (unsigned I = 0, E = UPN->getNumIncomingValues(); I != E; ++I) {
Instruction *Insn = dyn_cast<Instruction>(UPN->getIncomingValue(I));		Instruction *Insn = dyn_cast<Instruction>(UPN->getIncomingValue(I));
		davidxlUnsubmitted Done Reply Inline Actions Find all other incoming blocks from which incoming values of all PHIs in DestBB are the same as the ones from BB. davidxl: Find all other incoming blocks from which incoming values of all PHIs in DestBB are the same as…
if (Insn && Insn->getParent() == BB &&		if (Insn && Insn->getParent() == BB &&
Insn->getParent() != UPN->getIncomingBlock(I))		Insn->getParent() != UPN->getIncomingBlock(I))
return false;		return false;
}		}
}		}
}		}
}		}

// If BB and DestBB contain any common predecessors, then the phi nodes in BB		// If BB and DestBB contain any common predecessors, then the phi nodes in BB
// and DestBB may have conflicting incoming values for the block. If so, we		// and DestBB may have conflicting incoming values for the block. If so, we
		davidxlUnsubmitted Not Done Reply Inline Actions Since unique predecessor is checked here, so the PredBB's frequency is always no less than BB. Because of this, why don't skip the Frequency check (basically using ratio 1:1)? davidxl: Since unique predecessor is checked here, so the PredBB's frequency is always no less than BB.
		davidxlUnsubmitted Not Done Reply Inline Actions On second thought, considering the cost of a direct branch, it is probably better to set the default frequency ratio to be >=2 . Also the check of number of phis should probably a 'OR' instead of 'AND'. The default value of MinNumPhiInDestToSkipMerge should be 2: if (Freq(Pred) >= FreqRatioFreq(BB) \|\| NumCopyInsertionPHIs > MinNumPhis) return true; return false; davidxl:* On second thought, considering the cost of a direct branch, it is probably better to set the…
// can't merge the block.		// can't merge the block.
const PHINode *DestBBPN = dyn_cast<PHINode>(DestBB->begin());		const PHINode *DestBBPN = dyn_cast<PHINode>(DestBB->begin());
if (!DestBBPN) return true; // no conflict.		if (!DestBBPN) return true; // no conflict.

// Collect the preds of BB.		// Collect the preds of BB.
SmallPtrSet<const BasicBlock*, 16> BBPreds;		SmallPtrSet<const BasicBlock*, 16> BBPreds;
if (const PHINode *BBPN = dyn_cast<PHINode>(BB->begin())) {		if (const PHINode *BBPN = dyn_cast<PHINode>(BB->begin())) {
// It is faster to get preds from a PHI than with pred_iterator.		// It is faster to get preds from a PHI than with pred_iterator.
▲ Show 20 Lines • Show All 5,252 Lines • Show Last 20 Lines

test/CodeGen/X86/phi-immediate-factoring.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -disable-preheader-prot=true -march=x86 -stats 2>&1 \| grep "Number of blocks eliminated" \| grep 6			; RUN: llc < %s -disable-preheader-prot=true -march=x86 -stats 2>&1 \| grep "Number of blocks eliminated" \| grep 3
				; RUN: llc < %s -disable-preheader-prot=true -march=x86 -stats -cgp-freq-ratio-to-skip-merge=10 2>&1 \| grep "Number of blocks eliminated" \| grep 6
	; RUN: llc < %s -disable-preheader-prot=false -march=x86 -stats 2>&1 \| grep "Number of blocks eliminated" \| grep 3			; RUN: llc < %s -disable-preheader-prot=false -march=x86 -stats 2>&1 \| grep "Number of blocks eliminated" \| grep 3
	; PR1296			; PR1296

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64"
	target triple = "i686-apple-darwin8"			target triple = "i686-apple-darwin8"

	define i32 @foo(i32 %A, i32 %B, i32 %C) nounwind {			define i32 @foo(i32 %A, i32 %B, i32 %C) nounwind {
	entry:			entry:
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

test/CodeGen/X86/ragreedy-hoist-spill.ll

Show First 20 Lines • Show All 171 Lines • ▼ Show 20 Lines	while.cond197.backedge:
%oldc.1.be = phi i32 [ %oldc.13384, %sw.default ], [ %oldc.13384, %while.body200 ], [ %oldc.13384, %sw.bb1077 ], [ %oldc.13384, %sw.bb979 ], [ %oldc.13384, %sw.bb956 ], [ %oldc.13384, %sw.bb566 ], [ %oldc.13384, %for.end552 ], [ %oldc.13384, %sw.bb256 ], [ %oldc.13384, %sw.bb243 ], [ %oldc.13384, %while.cond201.preheader ], [ 0, %for.cond1145.preheader ], [ %oldc.13384, %sw.bb206 ]		%oldc.1.be = phi i32 [ %oldc.13384, %sw.default ], [ %oldc.13384, %while.body200 ], [ %oldc.13384, %sw.bb1077 ], [ %oldc.13384, %sw.bb979 ], [ %oldc.13384, %sw.bb956 ], [ %oldc.13384, %sw.bb566 ], [ %oldc.13384, %for.end552 ], [ %oldc.13384, %sw.bb256 ], [ %oldc.13384, %sw.bb243 ], [ %oldc.13384, %while.cond201.preheader ], [ 0, %for.cond1145.preheader ], [ %oldc.13384, %sw.bb206 ]
%cmp198 = icmp sgt i32 %dec3386, 0		%cmp198 = icmp sgt i32 %dec3386, 0
br i1 %cmp198, label %while.body200, label %while.end1465		br i1 %cmp198, label %while.body200, label %while.end1465

for.cond357:		for.cond357:
br label %for.cond357		br label %for.cond357

sw.bb474:		sw.bb474:
		; CHECK: sw.bb474
		; spill is hoisted here. Although loop depth1 is even hotter than loop depth2, sw.bb474 is still cold.
		; CHECK: movq %r{{.*}}, {{[0-9]+}}(%rsp)
		; CHECK: land.rhs485
		manmanrenUnsubmitted Done Reply Inline Actions The whole purpose of this testing case is to make sure that a spill is not hoisted to a hotter outer loop. As long as that is still true with your change, it is fine. manmanren: The whole purpose of this testing case is to make sure that a spill is not hoisted to a hotter…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions Thanks Manman. Yes, this change should not change such behavior. In this test, the spill is hoisted in sw.bb474, which is not the hotter outer loop; still even colder than the inner loop. junbuml: Thanks Manman. Yes, this change should not change such behavior. In this test, the spill is…
%cmp476 = icmp eq i8 undef, 0		%cmp476 = icmp eq i8 undef, 0
br i1 %cmp476, label %if.end517, label %do.body479.preheader		br i1 %cmp476, label %if.end517, label %do.body479.preheader

do.body479.preheader:		do.body479.preheader:
; CHECK: do.body479.preheader
; spill is hoisted here. Although loop depth1 is even hotter than loop depth2, do.body479.preheader is cold.
; CHECK: movq %r{{.*}}, {{[0-9]+}}(%rsp)
; CHECK: land.rhs485
%cmp4833314 = icmp eq i8 undef, 0		%cmp4833314 = icmp eq i8 undef, 0
br i1 %cmp4833314, label %if.end517, label %land.rhs485		br i1 %cmp4833314, label %if.end517, label %land.rhs485

land.rhs485:		land.rhs485:
%incdec.ptr4803316 = phi i8* [ %incdec.ptr480, %do.body479.backedge.land.rhs485_crit_edge ], [ undef, %do.body479.preheader ]		%incdec.ptr4803316 = phi i8* [ %incdec.ptr480, %do.body479.backedge.land.rhs485_crit_edge ], [ undef, %do.body479.preheader ]
%isascii.i.i27763151 = icmp sgt i8 undef, -1		%isascii.i.i27763151 = icmp sgt i8 undef, -1
br i1 %isascii.i.i27763151, label %cond.true.i.i2780, label %cond.false.i.i2782		br i1 %isascii.i.i27763151, label %cond.true.i.i2780, label %cond.false.i.i2782

▲ Show 20 Lines • Show All 197 Lines • Show Last 20 Lines

test/Transforms/CodeGenPrepare/AArch64/widen_switch.ll

	Show All 22 Lines

	return:			return:
	%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]			%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]
	ret i32 %retval			ret i32 %retval

	; ARM64-LABEL: @widen_switch_i16(			; ARM64-LABEL: @widen_switch_i16(
	; ARM64: %0 = zext i16 %trunc to i32			; ARM64: %0 = zext i16 %trunc to i32
	; ARM64-NEXT: switch i32 %0, label %sw.default [			; ARM64-NEXT: switch i32 %0, label %sw.default [
	; ARM64-NEXT: i32 1, label %return			; ARM64-NEXT: i32 1, label %sw.bb0
	; ARM64-NEXT: i32 65535, label %sw.bb1			; ARM64-NEXT: i32 65535, label %sw.bb1
	}			}

	; Widen to 32-bit from a smaller, non-native type.			; Widen to 32-bit from a smaller, non-native type.

	define i32 @widen_switch_i17(i32 %a) {			define i32 @widen_switch_i17(i32 %a) {
	entry:			entry:
	%trunc = trunc i32 %a to i17			%trunc = trunc i32 %a to i17
	Show All 13 Lines

	return:			return:
	%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]			%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]
	ret i32 %retval			ret i32 %retval

	; ARM64-LABEL: @widen_switch_i17(			; ARM64-LABEL: @widen_switch_i17(
	; ARM64: %0 = zext i17 %trunc to i32			; ARM64: %0 = zext i17 %trunc to i32
	; ARM64-NEXT: switch i32 %0, label %sw.default [			; ARM64-NEXT: switch i32 %0, label %sw.default [
	; ARM64-NEXT: i32 10, label %return			; ARM64-NEXT: i32 10, label %sw.bb0
	; ARM64-NEXT: i32 131071, label %sw.bb1			; ARM64-NEXT: i32 131071, label %sw.bb1
	}			}

	; If the switch condition is a sign-extended function argument, then the			; If the switch condition is a sign-extended function argument, then the
	; condition and cases should be sign-extended rather than zero-extended			; condition and cases should be sign-extended rather than zero-extended
	; because the sign-extension can be optimized away.			; because the sign-extension can be optimized away.

	define i32 @widen_switch_i16_sext(i2 signext %a) {			define i32 @widen_switch_i16_sext(i2 signext %a) {
	Show All 14 Lines

	return:			return:
	%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]			%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]
	ret i32 %retval			ret i32 %retval

	; ARM64-LABEL: @widen_switch_i16_sext(			; ARM64-LABEL: @widen_switch_i16_sext(
	; ARM64: %0 = sext i2 %a to i32			; ARM64: %0 = sext i2 %a to i32
	; ARM64-NEXT: switch i32 %0, label %sw.default [			; ARM64-NEXT: switch i32 %0, label %sw.default [
	; ARM64-NEXT: i32 1, label %return			; ARM64-NEXT: i32 1, label %sw.bb0
	; ARM64-NEXT: i32 -1, label %sw.bb1			; ARM64-NEXT: i32 -1, label %sw.bb1
	}			}

test/Transforms/CodeGenPrepare/X86/widen_switch.ll

	Show All 22 Lines

	return:			return:
	%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]			%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]
	ret i32 %retval			ret i32 %retval

	; X86-LABEL: @widen_switch_i16(			; X86-LABEL: @widen_switch_i16(
	; X86: %trunc = trunc i32 %a to i16			; X86: %trunc = trunc i32 %a to i16
	; X86-NEXT: switch i16 %trunc, label %sw.default [			; X86-NEXT: switch i16 %trunc, label %sw.default [
	; X86-NEXT: i16 1, label %return			; X86-NEXT: i16 1, label %sw.bb0
	; X86-NEXT: i16 -1, label %sw.bb1			; X86-NEXT: i16 -1, label %sw.bb1
	}			}

	; Widen to 32-bit from a smaller, non-native type.			; Widen to 32-bit from a smaller, non-native type.

	define i32 @widen_switch_i17(i32 %a) {			define i32 @widen_switch_i17(i32 %a) {
	entry:			entry:
	%trunc = trunc i32 %a to i17			%trunc = trunc i32 %a to i17
	Show All 13 Lines

	return:			return:
	%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]			%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]
	ret i32 %retval			ret i32 %retval

	; X86-LABEL: @widen_switch_i17(			; X86-LABEL: @widen_switch_i17(
	; X86: %0 = zext i17 %trunc to i32			; X86: %0 = zext i17 %trunc to i32
	; X86-NEXT: switch i32 %0, label %sw.default [			; X86-NEXT: switch i32 %0, label %sw.default [
	; X86-NEXT: i32 10, label %return			; X86-NEXT: i32 10, label %sw.bb0
	; X86-NEXT: i32 131071, label %sw.bb1			; X86-NEXT: i32 131071, label %sw.bb1
	}			}

	; If the switch condition is a sign-extended function argument, then the			; If the switch condition is a sign-extended function argument, then the
	; condition and cases should be sign-extended rather than zero-extended			; condition and cases should be sign-extended rather than zero-extended
	; because the sign-extension can be optimized away.			; because the sign-extension can be optimized away.

	define i32 @widen_switch_i16_sext(i2 signext %a) {			define i32 @widen_switch_i16_sext(i2 signext %a) {
	Show All 14 Lines

	return:			return:
	%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]			%retval = phi i32 [ -1, %sw.default ], [ 0, %sw.bb0 ], [ 1, %sw.bb1 ]
	ret i32 %retval			ret i32 %retval

	; X86-LABEL: @widen_switch_i16_sext(			; X86-LABEL: @widen_switch_i16_sext(
	; X86: %0 = sext i2 %a to i8			; X86: %0 = sext i2 %a to i8
	; X86-NEXT: switch i8 %0, label %sw.default [			; X86-NEXT: switch i8 %0, label %sw.default [
	; X86-NEXT: i8 1, label %return			; X86-NEXT: i8 1, label %sw.bb0
	; X86-NEXT: i8 -1, label %sw.bb1			; X86-NEXT: i8 -1, label %sw.bb1
	}			}

test/Transforms/CodeGenPrepare/skip-merging-case-block.ll

This file was added.

				; RUN: opt -codegenprepare < %s -mtriple=aarch64-none-linux-gnu -S \| FileCheck %s

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; Expect to skip merging two empty blocks (sw.bb and sw.bb2) into sw.epilog
				; as both of them are unlikely executed.
				define i32 @f_switch(i32 %c) {
				; CHECK-LABEL: @f_switch
				; CHECK-LABEL: entry:
				; CHECK: i32 10, label %sw.bb
				; CHECK: i32 20, label %sw.bb2
				entry:
				switch i32 %c, label %sw.default [
				i32 10, label %sw.bb
				i32 20, label %sw.bb2
				i32 30, label %sw.bb3
				i32 40, label %sw.bb4
				], !prof !0

				sw.bb: ; preds = %entry
				br label %sw.epilog

				sw.bb2: ; preds = %entry
				br label %sw.epilog

				sw.bb3: ; preds = %entry
				call void bitcast (void (...)* @callcase3 to void ()*)()
				br label %sw.epilog

				sw.bb4: ; preds = %entry
				call void bitcast (void (...)* @callcase4 to void ()*)()
				br label %sw.epilog

				sw.default: ; preds = %entry
				call void bitcast (void (...)* @calldefault to void ()*)()
				br label %sw.epilog

				; CHECK-LABEL: sw.epilog:
				; CHECK: %fp.0 = phi void (...)* [ @FD, %sw.default ], [ @F4, %sw.bb4 ], [ @F3, %sw.bb3 ], [ @F2, %sw.bb2 ], [ @F1, %sw.bb ]
				sw.epilog: ; preds = %sw.default, %sw.bb3, %sw.bb2, %sw.bb
				%fp.0 = phi void (...)* [ @FD, %sw.default ], [ @F4, %sw.bb4 ], [ @F3, %sw.bb3 ], [ @F2, %sw.bb2 ], [ @F1, %sw.bb ]
				%callee.knr.cast = bitcast void (...)* %fp.0 to void ()*
				call void %callee.knr.cast()
				ret i32 0
				}

				; Expect not to merge sw.bb2 because of the conflict in the incoming value from
				; sw.bb which is already merged.
				define i32 @f_switch2(i32 %c) {
				; CHECK-LABEL: @f_switch2
				; CHECK-LABEL: entry:
				; CHECK: i32 10, label %sw.epilog
				; CHECK: i32 20, label %sw.bb2
				entry:
				switch i32 %c, label %sw.default [
				i32 10, label %sw.bb
				i32 20, label %sw.bb2
				i32 30, label %sw.bb3
				i32 40, label %sw.bb4
				], !prof !1

				sw.bb: ; preds = %entry
				br label %sw.epilog

				sw.bb2: ; preds = %entry
				br label %sw.epilog

				sw.bb3: ; preds = %entry
				call void bitcast (void (...)* @callcase3 to void ()*)()
				br label %sw.epilog

				sw.bb4: ; preds = %entry
				call void bitcast (void (...)* @callcase4 to void ()*)()
				br label %sw.epilog

				sw.default: ; preds = %entry
				call void bitcast (void (...)* @calldefault to void ()*)()
				br label %sw.epilog

				; CHECK-LABEL: sw.epilog:
				; CHECK: %fp.0 = phi void (...)* [ @FD, %sw.default ], [ @F4, %sw.bb4 ], [ @F3, %sw.bb3 ], [ @F2, %sw.bb2 ], [ @F1, %entry ]
				sw.epilog: ; preds = %sw.default, %sw.bb3, %sw.bb2, %sw.bb
				%fp.0 = phi void (...)* [ @FD, %sw.default ], [ @F4, %sw.bb4 ], [ @F3, %sw.bb3 ], [ @F2, %sw.bb2 ], [ @F1, %sw.bb ]
				%callee.knr.cast = bitcast void (...)* %fp.0 to void ()*
				call void %callee.knr.cast()
				ret i32 0
				}

				; Multiple empty blocks should be considered together if all incoming values
				; from them are same. We expect to merge both empty blocks (sw.bb and sw.bb2)
				; because the sum of frequencies are higer than the threshold.
				define i32 @f_switch3(i32 %c) {
				; CHECK-LABEL: @f_switch3
				; CHECK-LABEL: entry:
				; CHECK: i32 10, label %sw.epilog
				; CHECK: i32 20, label %sw.epilog
				entry:
				switch i32 %c, label %sw.default [
				i32 10, label %sw.bb
				i32 20, label %sw.bb2
				i32 30, label %sw.bb3
				i32 40, label %sw.bb4
				], !prof !2

				sw.bb: ; preds = %entry
				br label %sw.epilog

				sw.bb2: ; preds = %entry
				br label %sw.epilog

				sw.bb3: ; preds = %entry
				call void bitcast (void (...)* @callcase3 to void ()*)()
				br label %sw.epilog

				sw.bb4: ; preds = %entry
				call void bitcast (void (...)* @callcase4 to void ()*)()
				br label %sw.epilog

				sw.default: ; preds = %entry
				call void bitcast (void (...)* @calldefault to void ()*)()
				br label %sw.epilog

				; CHECK-LABEL: sw.epilog:
				; CHECK: %fp.0 = phi void (...)* [ @FD, %sw.default ], [ @F4, %sw.bb4 ], [ @F3, %sw.bb3 ], [ @F1, %entry ], [ @F1, %entry ]
				sw.epilog: ; preds = %sw.default, %sw.bb3, %sw.bb2, %sw.bb
				%fp.0 = phi void (...)* [ @FD, %sw.default ], [ @F4, %sw.bb4 ], [ @F3, %sw.bb3 ], [ @F1, %sw.bb2 ], [ @F1, %sw.bb ]
				%callee.knr.cast = bitcast void (...)* %fp.0 to void ()*
				call void %callee.knr.cast()
				ret i32 0
				}

				declare void @F1(...) local_unnamed_addr
				declare void @F2(...) local_unnamed_addr
				declare void @F3(...) local_unnamed_addr
				declare void @F4(...) local_unnamed_addr
				declare void @FD(...) local_unnamed_addr
				declare void @callcase3(...) local_unnamed_addr
				declare void @callcase4(...) local_unnamed_addr
				declare void @calldefault(...) local_unnamed_addr

				!0 = !{!"branch_weights", i32 5, i32 1, i32 1,i32 5, i32 5}
				!1 = !{!"branch_weights", i32 1 , i32 5, i32 1,i32 1, i32 1}
				!2 = !{!"branch_weights", i32 1 , i32 4, i32 1,i32 1, i32 1}