Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.
Details
- Reviewers
t.p.northover wmi davidxl joerg manmanren mcrosier - Commits
- rG90b6b5074af4: [CodeGenPrep] Skip merging empty case blocks
rG85347dde27db: [CodeGenPrep] Skip merging empty case blocks
rG82f55c54460a: [CodeGenPrep] Skip merging empty case blocks
rL289988: [CodeGenPrep] Skip merging empty case blocks
rL289951: [CodeGenPrep] Skip merging empty case blocks
rL287553: [CodeGenPrep] Skip merging empty case blocks
Diff Detail
- Repository
- rL LLVM
Event Timeline
The change kind of makes sense to me, but I'd rather others with more experience on basic block merging to opine first.
I've added a couple of points inline. It would be good to see this behaviour in other targets, though.
It'd also be good to do a run on the test-suite to make sure there's no regressions (maybe code size?) in at least one target.
cheers,
--renato
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
426 ↗ | (On Diff #65136) | Why not just continue here? |
test/CodeGen/AArch64/aarch64-skip-merging-case-block.ll | ||
8 ↗ | (On Diff #65136) | Please, put the check lines where the related sources are, down there with the IR. It's not clear how this lines test anything you implemented above. |
22 ↗ | (On Diff #65136) | This is a very convoluted test. Can't you make it simpler? From the description in the comments, you just need a switch in a loop, and you don't need any of those structs, or extern functions or anything. |
164 ↗ | (On Diff #65136) | Can you simplify your code to not need those functions? The change comment implies you can... |
172 ↗ | (On Diff #65136) | Do you really need the attributes to make the test exhibit the behaviour you're trying to fix? |
Sorry for the long trip time on this. First, I make the test case very simple as Renato commented. I also make this applied conservatively because a branch instruction could be added in the empty case block. So, this will happen only when the frequency of empty case block is significantly lower than the frequency of header of switch.
Please take a look and let me know any comment.
Thanks,
Jun
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
426 ↗ | (On Diff #65136) | My intention was to perform continue for the outer for loop, not for the inner while loop. |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
424 ↗ | (On Diff #67840) | If the unique Predecessor of BB is terminated .. |
427 ↗ | (On Diff #67840) | Do you have actual examples showing the problem of extra copy instruction added? I could not connect the dots here. |
444 ↗ | (On Diff #67840) | How about the size impact (for Os build) ? |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
424 ↗ | (On Diff #67840) | Thanks ! |
427 ↗ | (On Diff #67840) | In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which was reduced from the benchmark I was targeting. This test case should show copies in the header of switch which is in a loop and make the situation in CGP. I removed this test as it's unnecessary complex to be used as a test case. Please see aarch64-skip-merging-case-block.ll in Diff 1 and let me know if you want me to add the test in this patch. |
444 ↗ | (On Diff #67840) | This could potentially add a branch instruction, so we should do this when OptSize is false. |
Fixing this in CGP seems like papering over real problems in ISEL or machine function optimization passes. The reason is that such code may be written by the user (not created by CGP): e.g.
switch( ..) { case 1: // no empty block created for this break; case .. ... }
Probably, smarter ISel could handle this issue. I'm not sure if GlobalISel could cover such case when handing phi nodes. Considering that the original purpose of CGP is to better prepare IR for local ISel, I think handling this in CGP is not that bad idea while we rely on the local ISel. Do you have any machine function pass in your mind where we can deal with this issue?
We actually see similar issues introduced by SimplifyCFG which introduces new critical edges which later can not be properly split in PhiElimination which results in copy instructions be inserted into blocks with higher frequency. Until we have a better solution, this patch seems like a reasonable workaround.
Hal, what is your opinion?
Could you check if https://reviews.llvm.org/D24818 would fix your issue? I suppose when there are mutiple switch targets, each edge is likely to have <50% taken ratio (even with static profile). Thus I guess all critical edges should be splitted by machine sink. Or, if not, you can simply change 50% to 99% and see if it fixes the problem.
I've thought about this only for a few minutes, and I'd definitely like to know if @danielcdh's patch also fixes this, but what is the benefit of creating critical edges here in the first place (i.e. when the case block is likely to be executed)? Shouldn't we decide later whether to speculate instructions (including copies that come from PHIs) later at the MI level (based on instruction costs and profiling data) as an independent set of decisions?
I tried dehao's patch on the original test case -- unfortunately it does not work. The reason is that MachineSink pass fails to split the critical edge because the indirect branch (from switch lowering) in the source BB prevents the edge from being split because it is not analyzable. See MachineBasicBlock::canSplitCriticalEdge
I've thought about this only for a few minutes, and I'd definitely like to know if @danielcdh's patch also fixes this, but what is the benefit of creating critical edges here in the first place (i.e. when the case block is likely to be executed)? Shouldn't we decide later whether to speculate instructions (including copies that come from PHIs) later at the MI level (based on instruction costs and profiling data) as an independent set of decisions?
By skipping merging an empty case block, we could add an extra branch, so I do this only when the frequency of empty case block is significantly low using BFI. Based on the original purpose of CGP to better prepare IR for local ISel, doing this in CGP doesn't seem to be a really bad idea for me. However, handling this independently in MI level could be more reasonable.
I tried dehao's patch on the original test case -- unfortunately it does not work. The reason is that MachineSink pass fails to split the critical edge because the indirect branch (from switch lowering) in the source BB prevents the edge from being split because it is not analyzable. See MachineBasicBlock::canSplitCriticalEdge
Thank you very much David for the test. Yes, I also confirmed that @danielcdh's patch didn't fix the case I was targeting. Let me take a closer look at the MachineSink pass to see if I can come up with a feasible solution there.
Could you point me to the original test case? I tried on the unittest in this patch, my patch can create the critical edge. And looks to me by the time of MachineSink, the switch has already been lowered to branches, thus there is no indirect branch?
Given the limitation of MachineSink pass to split critical edges later (for the switch case), the effect of creating critical edge in CGP here can be quite detrimental, so we should probably make the patch more general -- instead of checking just switch inst in Predecessor, check indirect branch as well. Perhaps also adding some cost related heuristic -- by checking the number of phis (with incoming value from Pred) in the Succ block.
Here is what happens In this particular case:
BB#7 <--- indirect branch with jump table / \ \
BB4 BB5 EB
/ | |
\ | /
BB2
EB gets eliminated by CGP
BB#7 <--- indirect branch with jump table / \ \
BB4 BB5 |
/ | |
\ | /
BB2
BB#2:
%vreg8<def> = PHI %vreg33, <BB#7>, %vreg19, d<BB#10>, .... %vreg9<def> = PHI %vreg34, <BB#7>, %vreg48, <BB#10>, .... %vreg10<def> = PHI %vreg35, <BB#7>, %vreg2, <BB#10>, ... %vreg11<def> = PHI %vreg36, <BB#7>, %vreg1, <BB#10>, ....
BB#7:
%vreg36<def> = COPY %vreg38; GPR64all:%vreg36 GPR64:%vreg38 %vreg35<def> = COPY %vreg39; GPR64all:%vreg35 GPR64:%vreg39 %vreg33<def> = COPY %vreg40; GPR32all:%vreg33 GPR32:%vreg40 %vreg34<def> = SUBREG_TO_REG 0, %vreg41, 15; GPR64all:%vreg34 GPR32:%vreg41 %vreg43<def> = SUBREG_TO_REG 0, %vreg44<kill>, 15; GPR64:%vreg43 GPR32common:%vreg44 %vreg47<def> = LDRXroX %vreg46, %vreg43<kill>, 0, 1; mem:LD8[JumpTable] GPR64:%vreg47,%vreg43 GPR64common:%vreg46 BR %vreg47<kill>; GPR64:%vreg47
After PHI elimination, due to the critical edge, BB#7 ends up with many copy instructions:
BB#2:
%vreg11<def> = COPY %vreg84<kill>; GPR64:%vreg11,%vreg84 %vreg10<def> = COPY %vreg83<kill>; GPR64:%vreg10,%vreg83 %vreg9<def> = COPY %vreg82<kill>; GPR64:%vreg9,%vreg82 %vreg8<def> = COPY %vreg81<kill>; GPR32all:%vreg8,%vreg81 ....
BB#7:
%vreg81<def> = COPY %vreg33<kill>; GPR32all:%vreg81,%vreg33 %vreg82<def> = COPY %vreg34<kill>; GPR64:%vreg82 GPR64all:%vreg34 %vreg83<def> = COPY %vreg35<kill>; GPR64:%vreg83 GPR64all:%vreg35 %vreg84<def> = COPY %vreg36<kill>; GPR64:%vreg84 GPR64all:%vreg36 %vreg85<def> = COPY %vreg19; GPR32all:%vreg85 GPR32sp:%vreg19 %vreg86<def> = COPY %vreg5; GPR64all:%vreg86,%vreg5 %vreg87<def> = COPY %vreg2; GPR64all:%vreg87,%vreg2 %vreg88<def> = COPY %vreg1; GPR64all:%vreg88,%vreg1
Some of these copy instructions which can not be sinked into less frequent path (due to failure in critical edge split) later become the address computation ADRP/MOV during virtual register rewrite.
Since there are multiple sinkable instructions in the source BB in this case, MachineSink without Dehao's patch (which handles single instruction case) should work if the edge is actually splitable.
@danielcdh: Could you point me to the original test case?
In my first posting in Diff 1, I added a test case (aarch64-skip-merging-case-block.ll) which was reduced from the benchmark I was targeting.
@davidxl: Given the limitation of MachineSink pass to split critical edges later (for the switch case), the effect of creating critical edge in CGP here can be quite detrimental, so we should probably make the patch more general -- instead of checking just switch inst in Predecessor, check indirect branch as well. Perhaps also adding some cost related heuristic -- by checking the number of phis (with incoming value from Pred) in the Succ block.
Thanks David for the detail. What you describe here is exactly what I observed. Based on your comment, I will update this change to avoid creating critical edges in case when potential sinkables cannot be handled in later pass (MachineSink); it's going to be more general to cover indirect branches as well, and a simple cost heuristic will be added based on the number of phis in Succ.
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
457 ↗ | (On Diff #73046) | in the header of switch --> in the predecessor of BB instead of BB (if it is not merged) |
471 ↗ | (On Diff #73046) | Since unique predecessor is checked here, so the PredBB's frequency is always no less than BB. Because of this, why don't skip the Frequency check (basically using ratio 1:1)? |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
471 ↗ | (On Diff #73046) | On second thought, considering the cost of a direct branch, it is probably better to set the default frequency ratio to be >=2 . Also the check of number of phis should probably a 'OR' instead of 'AND'. The default value of MinNumPhiInDestToSkipMerge should be 2: if (Freq(Pred) >= FreqRatio*Freq(BB) || NumCopyInsertionPHIs > MinNumPhis) return true; return false; |
On second thought, considering the cost of a direct branch, it is probably better to set the default frequency ratio to be >=2 . Also the check of number of phis should >probably a 'OR' instead of 'AND'. The default value of MinNumPhiInDestToSkipMerge should be 2:
if (Freq(Pred) >= FreqRatio*Freq(BB) || NumCopyInsertionPHIs > MinNumPhis)return true;
Thanks David for the review. I ran spec2000/2006 with the change you suggested. Most of the score changes seems to be noise, but I can see -2% reproducible regression in spec2006/povray when we are aggressive in this (i.e., the default frequency ratio to be >=2). I can see more I-cache miss with "FreqRatio=2" in spec2006/povray. It seems that it skips merging empty cases even for a switch with small number of cases and the cost of branch is non-trivial especially when it mess up the I-cache.
I think we need to be more conservative for the frequency ratio so that we skip merging only when we are sure that the empty case is less frequently executed.
As of now, with/without this change (with FreqRatio=2), I can see 6.7% more L1 I-cache miss from perf-stat. I didn’t try to narrow down to the place in which the regression was caused, but I printed out FreqRatio, NumCopyInsertionPHIs, and NumOfCases when this change applied and I can see this was applied even when the number of cases in switch is small enough due to the aggressive FreqRatio=2. Please let me know if you want me to further narrow down.
FreqRatio: 160 , NumCopyInsertionPHIs : 1, NumCase: 5
FreqRatio: 102 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 786 , NumCopyInsertionPHIs : 16, NumCase: 793
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 2
FreqRatio: 128 , NumCopyInsertionPHIs : 1, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 128 , NumCopyInsertionPHIs : 0, NumCase: 4
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 8 , NumCopyInsertionPHIs : 5, NumCase: 7
FreqRatio: 32 , NumCopyInsertionPHIs : 1, NumCase: 7
FreqRatio: 32 , NumCopyInsertionPHIs : 1, NumCase: 8
FreqRatio: 32 , NumCopyInsertionPHIs : 3, NumCase: 4
FreqRatio: 96 , NumCopyInsertionPHIs : 0, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 0, NumCase: 2
FreqRatio: 39 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 1024 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 1024 , NumCopyInsertionPHIs : 0, NumCase: 32
FreqRatio: 39 , NumCopyInsertionPHIs : 1, NumCase: 32
FreqRatio: 864 , NumCopyInsertionPHIs : 0, NumCase: 27
FreqRatio: 32 , NumCopyInsertionPHIs : 0, NumCase: 2
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 64 , NumCopyInsertionPHIs : 1, NumCase: 3
FreqRatio: 193 , NumCopyInsertionPHIs : 1, NumCase: 6
can you isolate one small case and show the final code with/without the change?
Let me take a small switch applied by this change from povray .
I can see a couple of issues:
- the cost of direct branch can be modelled better
- static profile can get it quite wrong if the branch prediction heuristic is wrong. Do you see similar regression with PGO? I can see without PGO, we should indeed make the frequency ratio to be much larger (i.e., require more cases in switch).
I agree that we may need to improve the static heuristic especially for switch.
I haven't tried PGO for this change. I can run povray with PGO.
Yes, I believe we should be conservative enough in the frequency ratio in CGP so that the extra branch should be added only when we are sure that the empty case is certainly less frequently executed.
- In PGO I can see just noise level regression (-0.5%) with this patch with FreqRatio=2.
- In Evaluate_TPat(), I can see 10 empty case blocks were applied in this patch, resulting in 10 more branches in the final assembly in aarch64 with -mcpu=kryo. Since SimplifyCFG also merge empty blocks in TryToSimplifyUncondBranchFromEmptyBlock(), making a simple case hit in CGP require quite complex testcase. That's why I removed my first testcase in Diff1 (aarch64-skip-merging-case-block.ll). Please let me know if you really need a reduce testcase.
- As of now, in this patch I believe we need to keep the default FreqRatio conservative, so I want to set it 1000, which I believe conservative enough.
Does it also enable more machine code sinking? If not, perhaps the heuristics to count copy insertion phis can be improved?
The initial purpose of this patch was to handle the cases which cannot be handled in MachineSink by letting ISel insert COPY in better place. By avoiding creating critical edge which is non-splittable in MachineSink, I think it could also indirectly increase the change of machine sinking. Since more number of copy insertion PHIs means more chances to sink in later pass, integrating the number of PHIs in the heuristics seems to be reasonable to me. However, I think the minimum frequency check should come first because we don't want to leave extra branch in a likely executed block regardless of the number of PHIs. To integrate NumCopyInsertionPHIs and minimum frequency requirement, I thought about the heuristics something like :
CurFreqRatio = Freq(PredBB) / Freq(BB);
if (CurFreqRatio > MinThresholdToSkipMerge && (NumCopyInsertionPHIs * CurFreqRatio) > MinNumPhiInDestToSkipMerge * MinThresholdToSkipMerge) {
// skip merging current empty case
}
Please let me know your thought.
How do you come up with 1000? It seems like specially tuned for that benchmark where the switch top is in a loop while the switch body is not -- the patch basically will never kick in for any switch in normal form without profile.
In case of the benchmark I was targeting, it has a huge switch in a loop and the switch have many empty cases only used as incoming blocks in PHIs. So, merging those empty cases cause many COPYs in the header of switch and MachineSink cannot sink them because it cannot split the critical edges across the jump table. Since the switch is huge, the ratio of FreqOfSwitchHeader to FreqOfEmptyCase was pretty large, more than 3000. In this patch I wanted to hit only such clear cases where the frequency of case is certainly lower than the switch header because the cost of extra branch added by skipping merging empty cases is sometime non-trivial especially when the case is likely to be taken as we observed in the povray case.
Integrated David's comments. Thanks David for your review with the detail comments.
Basically, I took the heuristic you suggested Freq(Pred) / Freq(BB) > 2. Please find the detail from the comment in the code.
Using the latest truck, ran performance test for spec2000/2006 and I didn't see regression even in spec2006.povray. The function applied by this patch is not even hot in my profile so the previous regression I observed must be caused by difference code alignment.
Regarding inspecting PHIs, for me, it seems not that easy to predict if a PHI ends up with a COPY in CodeGenPrepare, as COPYs might be added when performing deSSA and removed in later passes (e.g., Register Coalescing or Machine Copy Propagation). In this cost heuristic in CGP, I believe it's not unreasonable to see a PHI as one potential COPY. In the worst case where none of PHIs results in a COPY, the empty BB which is skipped here might end up with only one branch instruction, so it will be removed in BranchFold pass.
Regarding the multiple empty block case, when there are multiple empty blocks which are used as incoming blocks in the same PHI, we may be able to merge only one of them in most case. That is because of the conflict in incoming values in the PHI if one of them is already merged. I modified the testcase to introduce two empty blocks. In the first test, f_switch(), both two empty blocks are skipped as both of them are unlikely executed. In the second test, f_switch2(), once the first empty block (sw.bb) is merged, the second block(sw.bb2) cannot be merged because of the conflict in incoming value from sw.bb which is already merged. If all the incoming values are the same from the multiple empty blocks in the same PHI, then it should be either all merged or skipped, but I doubt if we can see such case in CGP in general. Please let me know if you were mentioning different cases in your previous comment.
Thanks,
Jun
Integrated David's comments by adding support to handle multiple empty blocks sharing the same incoming value for the PHIs in the DestBB. Added f_switch3() in the testcase, which will describe the case of multiple empty blocks. Please take a look and let me know any comments.
Thanks,
Jun
Addressed comments from David Li and added fixes in test case failures in :
ragreedy-hoist-spill.ll AArch64/widen_switch.ll X86/widen_switch.ll phi-immediate-factoring.ll
Hi Manman / Wei,
This change made a change in ragreedy-hoist-spill.ll which was modified in 8dbd561a and 815b02e9. Can you please review if the change in ragreedy-hoist-spill.ll is still make sense?
Thanks,
Jun
test/CodeGen/X86/ragreedy-hoist-spill.ll | ||
---|---|---|
183 ↗ | (On Diff #77220) | The whole purpose of this testing case is to make sure that a spill is not hoisted to a hotter outer loop. As long as that is still true with your change, it is fine. |
test/CodeGen/X86/ragreedy-hoist-spill.ll | ||
---|---|---|
183 ↗ | (On Diff #77220) | Thanks Manman. |
Do you know how often BPI/BFI is actually needed? In other words, among all functions compiled, how many of them reach to the point that BFI is needed? If you have some stats that will be great. If the data shows not often -- then BPI/BFI may need to be computed in a lazy way.
Do you know how often BPI/BFI is actually needed? In other words, among all functions compiled, how many of them reach to the point that BFI is needed? If you have some stats that will be great. If the data shows not often -- then BPI/BFI may need to be computed in a lazy way.
I'm quite sure that BPI/BFI would be rarely used as we specifically look for empty blocks which has unique predecessor terminated by SwitchInst or IndirectBrInst. We can compute BPI/BFI in isMergingEmptyBlockProfitable() and then cache it only for the current function. Please let me know if you see any down side of it?
This change was reverted in r288052 due to the invalid loop info after eliminating an empty block. Now, I create LoopInfo when creating BFI/BPI so that it's not impacted by previous empty block eliminated.
Can you include the failing test case just to make sure no future changes will trigger this again?
Kindly ping one more time.
This change was reverted due to the invalid loop info after eliminating an empty block. The only change I made is that I create a LoopInfo when creating BFI/BPI so that it's not impacted by previous empty block eliminated.