This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrep] Skip merging empty case blocks
ClosedPublic

Authored by junbuml on Jul 22 2016, 1:45 PM.

Details

Summary

Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.

Diff Detail

Repository
rL LLVM

Event Timeline

junbuml updated this revision to Diff 65136.Jul 22 2016, 1:45 PM
junbuml retitled this revision from to [CodeGenPrep] Skip merging empty case blocks.
junbuml updated this object.
junbuml added a subscriber: llvm-commits.
junbuml updated this object.Jul 22 2016, 1:49 PM
rengolin edited edge metadata.Jul 23 2016, 6:43 AM

The change kind of makes sense to me, but I'd rather others with more experience on basic block merging to opine first.

I've added a couple of points inline. It would be good to see this behaviour in other targets, though.

It'd also be good to do a run on the test-suite to make sure there's no regressions (maybe code size?) in at least one target.

cheers,
--renato

lib/CodeGen/CodeGenPrepare.cpp
426 ↗(On Diff #65136)

Why not just continue here?

test/CodeGen/AArch64/aarch64-skip-merging-case-block.ll
8 ↗(On Diff #65136)

Please, put the check lines where the related sources are, down there with the IR.

It's not clear how this lines test anything you implemented above.

22 ↗(On Diff #65136)

This is a very convoluted test. Can't you make it simpler?

From the description in the comments, you just need a switch in a loop, and you don't need any of those structs, or extern functions or anything.

164 ↗(On Diff #65136)

Can you simplify your code to not need those functions? The change comment implies you can...

172 ↗(On Diff #65136)

Do you really need the attributes to make the test exhibit the behaviour you're trying to fix?

junbuml updated this revision to Diff 67840.Aug 12 2016, 8:54 AM
junbuml edited edge metadata.
junbuml marked 5 inline comments as done.

Sorry for the long trip time on this. First, I make the test case very simple as Renato commented. I also make this applied conservatively because a branch instruction could be added in the empty case block. So, this will happen only when the frequency of empty case block is significantly lower than the frequency of header of switch.
Please take a look and let me know any comment.
Thanks,
Jun

lib/CodeGen/CodeGenPrepare.cpp
426 ↗(On Diff #65136)

My intention was to perform continue for the outer for loop, not for the inner while loop.

davidxl added inline comments.Aug 19 2016, 3:30 PM
lib/CodeGen/CodeGenPrepare.cpp
424 ↗(On Diff #67840)

If the unique Predecessor of BB is terminated ..

427 ↗(On Diff #67840)

Do you have actual examples showing the problem of extra copy instruction added? I could not connect the dots here.

444 ↗(On Diff #67840)

How about the size impact (for Os build) ?

junbuml updated this revision to Diff 68888.Aug 22 2016, 11:24 AM
junbuml marked an inline comment as done.

Address comments from David.

junbuml marked an inline comment as done.Aug 22 2016, 11:25 AM
junbuml added inline comments.
lib/CodeGen/CodeGenPrepare.cpp
424 ↗(On Diff #67840)

Thanks !

427 ↗(On Diff #67840)

In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which was reduced from the benchmark I was targeting. This test case should show copies in the header of switch which is in a loop and make the situation in CGP. I removed this test as it's unnecessary complex to be used as a test case. Please see aarch64-skip-merging-case-block.ll in Diff 1 and let me know if you want me to add the test in this patch.

444 ↗(On Diff #67840)

This could potentially add a branch instruction, so we should do this when OptSize is false.

davidxl edited edge metadata.Aug 26 2016, 12:34 PM

Fixing this in CGP seems like papering over real problems in ISEL or machine function optimization passes. The reason is that such code may be written by the user (not created by CGP): e.g.

switch( ..) {
   case 1:             // no empty block created for this
     break;
   case ..
         ...
  }
junbuml marked an inline comment as done.Aug 26 2016, 1:16 PM

Probably, smarter ISel could handle this issue. I'm not sure if GlobalISel could cover such case when handing phi nodes. Considering that the original purpose of CGP is to better prepare IR for local ISel, I think handling this in CGP is not that bad idea while we rely on the local ISel. Do you have any machine function pass in your mind where we can deal with this issue?

davidxl edited edge metadata.Aug 26 2016, 4:14 PM
davidxl added a subscriber: hfinkel.

Kindly ping. Please let me know any comment on it.

Kindly ping. Is there any comment or suggestion?

We actually see similar issues introduced by SimplifyCFG which introduces new critical edges which later can not be properly split in PhiElimination which results in copy instructions be inserted into blocks with higher frequency. Until we have a better solution, this patch seems like a reasonable workaround.

Hal, what is your opinion?

Could you check if https://reviews.llvm.org/D24818 would fix your issue? I suppose when there are mutiple switch targets, each edge is likely to have <50% taken ratio (even with static profile). Thus I guess all critical edges should be splitted by machine sink. Or, if not, you can simply change 50% to 99% and see if it fixes the problem.

We actually see similar issues introduced by SimplifyCFG which introduces new critical edges which later can not be properly split in PhiElimination which results in copy instructions be inserted into blocks with higher frequency. Until we have a better solution, this patch seems like a reasonable workaround.

Hal, what is your opinion?

I've thought about this only for a few minutes, and I'd definitely like to know if @danielcdh's patch also fixes this, but what is the benefit of creating critical edges here in the first place (i.e. when the case block is likely to be executed)? Shouldn't we decide later whether to speculate instructions (including copies that come from PHIs) later at the MI level (based on instruction costs and profiling data) as an independent set of decisions?

I tried dehao's patch on the original test case -- unfortunately it does not work. The reason is that MachineSink pass fails to split the critical edge because the indirect branch (from switch lowering) in the source BB prevents the edge from being split because it is not analyzable. See MachineBasicBlock::canSplitCriticalEdge

I've thought about this only for a few minutes, and I'd definitely like to know if @danielcdh's patch also fixes this, but what is the benefit of creating critical edges here in the first place (i.e. when the case block is likely to be executed)? Shouldn't we decide later whether to speculate instructions (including copies that come from PHIs) later at the MI level (based on instruction costs and profiling data) as an independent set of decisions?

By skipping merging an empty case block, we could add an extra branch, so I do this only when the frequency of empty case block is significantly low using BFI. Based on the original purpose of CGP to better prepare IR for local ISel, doing this in CGP doesn't seem to be a really bad idea for me. However, handling this independently in MI level could be more reasonable.

I tried dehao's patch on the original test case -- unfortunately it does not work. The reason is that MachineSink pass fails to split the critical edge because the indirect branch (from switch lowering) in the source BB prevents the edge from being split because it is not analyzable. See MachineBasicBlock::canSplitCriticalEdge

Thank you very much David for the test. Yes, I also confirmed that @danielcdh's patch didn't fix the case I was targeting. Let me take a closer look at the MachineSink pass to see if I can come up with a feasible solution there.

Could you point me to the original test case? I tried on the unittest in this patch, my patch can create the critical edge. And looks to me by the time of MachineSink, the switch has already been lowered to branches, thus there is no indirect branch?

Given the limitation of MachineSink pass to split critical edges later (for the switch case), the effect of creating critical edge in CGP here can be quite detrimental, so we should probably make the patch more general -- instead of checking just switch inst in Predecessor, check indirect branch as well. Perhaps also adding some cost related heuristic -- by checking the number of phis (with incoming value from Pred) in the Succ block.

Here is what happens In this particular case:

  BB#7                   <--- indirect branch with jump table
/      \       \

BB4 BB5 EB

/

\ | /

BB2

EB gets eliminated by CGP

BB#7                   <--- indirect branch with jump table
/      \       \

BB4 BB5 |

/

\ | /

BB2

BB#2:

%vreg8<def> = PHI %vreg33, <BB#7>, %vreg19, d<BB#10>, ....
 %vreg9<def> = PHI %vreg34, <BB#7>, %vreg48, <BB#10>, ....
 %vreg10<def> = PHI %vreg35, <BB#7>, %vreg2, <BB#10>, ...
 %vreg11<def> = PHI %vreg36, <BB#7>, %vreg1, <BB#10>,  ....

BB#7:

%vreg36<def> = COPY %vreg38; GPR64all:%vreg36 GPR64:%vreg38
 %vreg35<def> = COPY %vreg39; GPR64all:%vreg35 GPR64:%vreg39
 %vreg33<def> = COPY %vreg40; GPR32all:%vreg33 GPR32:%vreg40
 %vreg34<def> = SUBREG_TO_REG 0, %vreg41, 15; GPR64all:%vreg34 GPR32:%vreg41
 %vreg43<def> = SUBREG_TO_REG 0, %vreg44<kill>, 15; GPR64:%vreg43 GPR32common:%vreg44
 %vreg47<def> = LDRXroX %vreg46, %vreg43<kill>, 0, 1; mem:LD8[JumpTable] GPR64:%vreg47,%vreg43 GPR64common:%vreg46
 BR %vreg47<kill>; GPR64:%vreg47

After PHI elimination, due to the critical edge, BB#7 ends up with many copy instructions:

BB#2:

 %vreg11<def> = COPY %vreg84<kill>; GPR64:%vreg11,%vreg84
%vreg10<def> = COPY %vreg83<kill>; GPR64:%vreg10,%vreg83
%vreg9<def> = COPY %vreg82<kill>; GPR64:%vreg9,%vreg82
%vreg8<def> = COPY %vreg81<kill>; GPR32all:%vreg8,%vreg81

 ....

BB#7:

%vreg81<def> = COPY %vreg33<kill>; GPR32all:%vreg81,%vreg33
   %vreg82<def> = COPY %vreg34<kill>; GPR64:%vreg82 GPR64all:%vreg34
   %vreg83<def> = COPY %vreg35<kill>; GPR64:%vreg83 GPR64all:%vreg35
   %vreg84<def> = COPY %vreg36<kill>; GPR64:%vreg84 GPR64all:%vreg36
   %vreg85<def> = COPY %vreg19; GPR32all:%vreg85 GPR32sp:%vreg19
   %vreg86<def> = COPY %vreg5; GPR64all:%vreg86,%vreg5
   %vreg87<def> = COPY %vreg2; GPR64all:%vreg87,%vreg2
   %vreg88<def> = COPY %vreg1; GPR64all:%vreg88,%vreg1

Some of these copy instructions which can not be sinked into less frequent path (due to failure in critical edge split) later become the address computation ADRP/MOV during virtual register rewrite.

Since there are multiple sinkable instructions in the source BB in this case, MachineSink without Dehao's patch (which handles single instruction case) should work if the edge is actually splitable.

@danielcdh: Could you point me to the original test case?

In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which was reduced from the benchmark I was targeting.

@davidxl: Given the limitation of MachineSink pass to split critical edges later (for the switch case), the effect of creating critical edge in CGP here can be quite detrimental, so we should probably make the patch more general -- instead of checking just switch inst in Predecessor, check indirect branch as well. Perhaps also adding some cost related heuristic -- by checking the number of phis (with incoming value from Pred) in the Succ block.

Thanks David for the detail. What you describe here is exactly what I observed. Based on your comment, I will update this change to avoid creating critical edges in case when potential sinkables cannot be handled in later pass (MachineSink); it's going to be more general to cover indirect branches as well, and a simple cost heuristic will be added based on the number of phis in Succ.

junbuml updated this revision to Diff 73046.Sep 30 2016, 7:20 AM

Updated the change based on David Li's comments.

davidxl added inline comments.Sep 30 2016, 5:19 PM
lib/CodeGen/CodeGenPrepare.cpp
457 ↗(On Diff #73046)

in the header of switch --> in the predecessor of BB instead of BB (if it is not merged)

471 ↗(On Diff #73046)

Since unique predecessor is checked here, so the PredBB's frequency is always no less than BB. Because of this, why don't skip the Frequency check (basically using ratio 1:1)?

davidxl added inline comments.Sep 30 2016, 7:41 PM
lib/CodeGen/CodeGenPrepare.cpp
471 ↗(On Diff #73046)

On second thought, considering the cost of a direct branch, it is probably better to set the default frequency ratio to be >=2 . Also the check of number of phis should probably a 'OR' instead of 'AND'. The default value of MinNumPhiInDestToSkipMerge should be 2:

if (Freq(Pred) >= FreqRatio*Freq(BB) || NumCopyInsertionPHIs > MinNumPhis)
   return true;

return false;
rengolin resigned from this revision.Oct 4 2016, 5:29 AM
rengolin removed a reviewer: rengolin.

On second thought, considering the cost of a direct branch, it is probably better to set the default frequency ratio to be >=2 . Also the check of number of phis should >probably a 'OR' instead of 'AND'. The default value of MinNumPhiInDestToSkipMerge should be 2:
if (Freq(Pred) >= FreqRatio*Freq(BB) || NumCopyInsertionPHIs > MinNumPhis)

return true;

Thanks David for the review. I ran spec2000/2006 with the change you suggested. Most of the score changes seems to be noise, but I can see -2% reproducible regression in spec2006/povray when we are aggressive in this (i.e., the default frequency ratio to be >=2). I can see more I-cache miss with "FreqRatio=2" in spec2006/povray. It seems that it skips merging empty cases even for a switch with small number of cases and the cost of branch is non-trivial especially when it mess up the I-cache.

I think we need to be more conservative for the frequency ratio so that we skip merging only when we are sure that the empty case is less frequently executed.

As of now, with/without this change (with FreqRatio=2), I can see 6.7% more L1 I-cache miss from perf-stat. I didn’t try to narrow down to the place in which the regression was caused, but I printed out FreqRatio, NumCopyInsertionPHIs, and NumOfCases when this change applied and I can see this was applied even when the number of cases in switch is small enough due to the aggressive FreqRatio=2. Please let me know if you want me to further narrow down.

FreqRatio: 160 , NumCopyInsertionPHIs : 1, NumCase: 5
FreqRatio: 102 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 786 , NumCopyInsertionPHIs : 16, NumCase: 793
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 2
FreqRatio: 128 , NumCopyInsertionPHIs : 1, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 32 , NumCopyInsertionPHIs : 1, NumCase: 7
FreqRatio: 32 , NumCopyInsertionPHIs : 1, NumCase: 8
FreqRatio: 32 , NumCopyInsertionPHIs : 3, NumCase: 4
FreqRatio: 96 , NumCopyInsertionPHIs : 0, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 0, NumCase: 2
FreqRatio: 39 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 1024 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 1024 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 39 , NumCopyInsertionPHIs : 1, NumCase: 32
FreqRatio: 864 , NumCopyInsertionPHIs : 0, NumCase: 27
FreqRatio: 32 , NumCopyInsertionPHIs : 0, NumCase: 2
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 193 , NumCopyInsertionPHIs : 1, NumCase: 6

can you isolate one small case and show the final code with/without the change?

Let me take a small switch applied by this change from povray .

I can see a couple of issues:

  1. the cost of direct branch can be modelled better
  2. static profile can get it quite wrong if the branch prediction heuristic is wrong. Do you see similar regression with PGO? I can see without PGO, we should indeed make the frequency ratio to be much larger (i.e., require more cases in switch).

I agree that we may need to improve the static heuristic especially for switch.
I haven't tried PGO for this change. I can run povray with PGO.
Yes, I believe we should be conservative enough in the frequency ratio in CGP so that the extra branch should be added only when we are sure that the empty case is certainly less frequently executed.

  • In PGO I can see just noise level regression (-0.5%) with this patch with FreqRatio=2.
  • In Evaluate_TPat(), I can see 10 empty case blocks were applied in this patch, resulting in 10 more branches in the final assembly in aarch64 with -mcpu=kryo. Since SimplifyCFG also merge empty blocks in TryToSimplifyUncondBranchFromEmptyBlock(), making a simple case hit in CGP require quite complex testcase. That's why I removed my first testcase in Diff1 (aarch64-skip-merging-case-block.ll). Please let me know if you really need a reduce testcase.
  • As of now, in this patch I believe we need to keep the default FreqRatio conservative, so I want to set it 1000, which I believe conservative enough.

Does it also enable more machine code sinking? If not, perhaps the heuristics to count copy insertion phis can be improved?

The initial purpose of this patch was to handle the cases which cannot be handled in MachineSink by letting ISel insert COPY in better place. By avoiding creating critical edge which is non-splittable in MachineSink, I think it could also indirectly increase the change of machine sinking. Since more number of copy insertion PHIs means more chances to sink in later pass, integrating the number of PHIs in the heuristics seems to be reasonable to me. However, I think the minimum frequency check should come first because we don't want to leave extra branch in a likely executed block regardless of the number of PHIs. To integrate NumCopyInsertionPHIs and minimum frequency requirement, I thought about the heuristics something like :

CurFreqRatio = Freq(PredBB) / Freq(BB);
if (CurFreqRatio > MinThresholdToSkipMerge && (NumCopyInsertionPHIs * CurFreqRatio) > MinNumPhiInDestToSkipMerge * MinThresholdToSkipMerge) {

// skip merging current empty case

}

Please let me know your thought.

How do you come up with 1000? It seems like specially tuned for that benchmark where the switch top is in a loop while the switch body is not -- the patch basically will never kick in for any switch in normal form without profile.

In case of the benchmark I was targeting, it has a huge switch in a loop and the switch have many empty cases only used as incoming blocks in PHIs. So, merging those empty cases cause many COPYs in the header of switch and MachineSink cannot sink them because it cannot split the critical edges across the jump table. Since the switch is huge, the ratio of FreqOfSwitchHeader to FreqOfEmptyCase was pretty large, more than 3000. In this patch I wanted to hit only such clear cases where the frequency of case is certainly lower than the switch header because the cost of extra branch added by skipping merging empty cases is sometime non-trivial especially when the case is likely to be taken as we observed in the povray case.

junbuml updated this revision to Diff 74284.Oct 11 2016, 2:28 PM

Integrated David's comments. Thanks David for your review with the detail comments.

Basically, I took the heuristic you suggested Freq(Pred) / Freq(BB) > 2. Please find the detail from the comment in the code.

Using the latest truck, ran performance test for spec2000/2006 and I didn't see regression even in spec2006.povray. The function applied by this patch is not even hot in my profile so the previous regression I observed must be caused by difference code alignment.

Regarding inspecting PHIs, for me, it seems not that easy to predict if a PHI ends up with a COPY in CodeGenPrepare, as COPYs might be added when performing deSSA and removed in later passes (e.g., Register Coalescing or Machine Copy Propagation). In this cost heuristic in CGP, I believe it's not unreasonable to see a PHI as one potential COPY. In the worst case where none of PHIs results in a COPY, the empty BB which is skipped here might end up with only one branch instruction, so it will be removed in BranchFold pass.

Regarding the multiple empty block case, when there are multiple empty blocks which are used as incoming blocks in the same PHI, we may be able to merge only one of them in most case. That is because of the conflict in incoming values in the PHI if one of them is already merged. I modified the testcase to introduce two empty blocks. In the first test, f_switch(), both two empty blocks are skipped as both of them are unlikely executed. In the second test, f_switch2(), once the first empty block (sw.bb) is merged, the second block(sw.bb2) cannot be merged because of the conflict in incoming value from sw.bb which is already merged. If all the incoming values are the same from the multiple empty blocks in the same PHI, then it should be either all merged or skipped, but I doubt if we can see such case in CGP in general. Please let me know if you were mentioning different cases in your previous comment.

Thanks,
Jun

junbuml updated this revision to Diff 75209.Oct 19 2016, 1:19 PM

Integrated David's comments by adding support to handle multiple empty blocks sharing the same incoming value for the PHIs in the DestBB. Added f_switch3() in the testcase, which will describe the case of multiple empty blocks. Please take a look and let me know any comments.

Thanks,
Jun

davidxl added inline comments.Nov 7 2016, 10:52 AM
lib/CodeGen/CodeGenPrepare.cpp
383 ↗(On Diff #75209)

Nit: If --> Of

466 ↗(On Diff #75209)

Use early return to reduce nesting level.

487 ↗(On Diff #75209)

Find all other incoming blocks from which incoming values of all PHIs in DestBB are the same as the ones from BB.

junbuml updated this revision to Diff 77220.Nov 8 2016, 11:33 AM
junbuml marked an inline comment as done.

Addressed comments from David Li and added fixes in test case failures in :

ragreedy-hoist-spill.ll
AArch64/widen_switch.ll
X86/widen_switch.ll  
phi-immediate-factoring.ll
junbuml added a reviewer: wmi.Nov 8 2016, 11:52 AM
junbuml marked 3 inline comments as done.

Hi Manman / Wei,

This change made a change in ragreedy-hoist-spill.ll which was modified in 8dbd561a and 815b02e9. Can you please review if the change in ragreedy-hoist-spill.ll is still make sense?
Thanks,
Jun

manmanren added inline comments.Nov 9 2016, 11:41 AM
test/CodeGen/X86/ragreedy-hoist-spill.ll
183 ↗(On Diff #77220)

The whole purpose of this testing case is to make sure that a spill is not hoisted to a hotter outer loop.

As long as that is still true with your change, it is fine.

junbuml marked an inline comment as done.Nov 9 2016, 12:31 PM
junbuml added inline comments.
test/CodeGen/X86/ragreedy-hoist-spill.ll
183 ↗(On Diff #77220)

Thanks Manman.
Yes, this change should not change such behavior. In this test, the spill is hoisted in sw.bb474, which is not the hotter outer loop; still even colder than the inner loop.

junbuml marked an inline comment as done.Nov 18 2016, 6:59 AM

Kindly ping?

Do you know how often BPI/BFI is actually needed? In other words, among all functions compiled, how many of them reach to the point that BFI is needed? If you have some stats that will be great. If the data shows not often -- then BPI/BFI may need to be computed in a lazy way.

Do you know how often BPI/BFI is actually needed? In other words, among all functions compiled, how many of them reach to the point that BFI is needed? If you have some stats that will be great. If the data shows not often -- then BPI/BFI may need to be computed in a lazy way.

I'm quite sure that BPI/BFI would be rarely used as we specifically look for empty blocks which has unique predecessor terminated by SwitchInst or IndirectBrInst. We can compute BPI/BFI in isMergingEmptyBlockProfitable() and then cache it only for the current function. Please let me know if you see any down side of it?

junbuml updated this revision to Diff 78630.Nov 19 2016, 9:30 AM

Address David's comment. Now we compute BFI/BPI only when it need to be.

davidxl accepted this revision.Nov 19 2016, 12:14 PM
davidxl edited edge metadata.

lgtm

This revision is now accepted and ready to land.Nov 19 2016, 12:14 PM
This revision was automatically updated to reflect the committed changes.
joerg added a subscriber: joerg.Nov 28 2016, 11:07 AM

Reverted with r288052.

junbuml updated this revision to Diff 79746.Nov 30 2016, 8:37 AM
junbuml edited edge metadata.
junbuml removed rL LLVM as the repository for this revision.

This change was reverted in r288052 due to the invalid loop info after eliminating an empty block. Now, I create LoopInfo when creating BFI/BPI so that it's not impacted by previous empty block eliminated.

junbuml reopened this revision.Nov 30 2016, 8:38 AM
This revision is now accepted and ready to land.Nov 30 2016, 8:38 AM
junbuml requested a review of this revision.Nov 30 2016, 8:39 AM
junbuml edited edge metadata.
joerg edited edge metadata.Nov 30 2016, 2:12 PM

Can you include the failing test case just to make sure no future changes will trigger this again?

junbuml updated this revision to Diff 79920.Dec 1 2016, 8:43 AM
junbuml edited edge metadata.

Addressed Joerg's comment.

Kindly ping?

Kindly ping one more time.

This change was reverted due to the invalid loop info after eliminating an empty block. The only change I made is that I create a LoopInfo when creating BFI/BPI so that it's not impacted by previous empty block eliminated.

davidxl accepted this revision.Dec 15 2016, 10:01 AM
davidxl edited edge metadata.

lgtm

This revision is now accepted and ready to land.Dec 15 2016, 10:01 AM
This revision was automatically updated to reflect the committed changes.