This is an archive of the discontinued LLVM Phabricator instance.

[Greedy Regalloc] Reg splitting based on loops
Needs ReviewPublic

Authored by wmi on Feb 12 2016, 9:40 PM.

Details

Reviewers
qcolombet
Summary

The patch is to add reg splitting based on loops, especially hot loops. It is a supplement to current region splitting.

During analysis of PR26597, PR24278 and PR24348, we find some typical senarios which cannot be well handled by current region splitting.

  • Vreg1 has reference inside loop1 and lives through loop2. Vreg2 is both defined and used inside loop2. Vreg1 gets a physical register before Vreg2's allocation. Loop2 has high register pressure, so vreg2 cannot find available physical register. Vreg1 has higher weight than vreg2 because it is used in loop1 (The hotness of loop1 and loop2 are comparable), so vreg2 cannot evict vreg1 and can only be spilled. However, it is better to split vreg1 around loop2. If vreg1 is splitted around loop2, the vreg1' generated around loop2 will only have define and use at the loop preheader and exit block (cold compared with loop body), which means it has low weight and can be evicted by vreg2. After vreg2 evicts vreg1', vreg2 will get physical register and there is less spill in loop2. Current region splitting will not try splitting vreg1 when allocating for vreg2 -- it will only try splitting vreg2. PR26597 belongs to this case.
  • Both vreg1 and vreg2 have references inside a loop. The reference of vreg2 in the loop can be rematerialized but vreg1 cannot, which indicates vreg2 should have lower weight than vreg1. But in reality vreg2 still has higher weight than vreg1 because it has shorter live range than vreg1, which is another factor affecting live range weight. After vreg2 gets physical register, vreg1 cannot evict vreg2 because it has lower weight. vreg1 cannot be rematerialized so there will be a spill inserted in the loop. For this case, it is also better to split vreg1 around loop, so vreg1' generated by split around loop will have higher weight than vreg2 because it is shorter. vreg1' will evict vreg2 and vreg2 reference in loop will be rematerialized. There will be less spill in loop2. Current region splitting will also split vreg1, but the splitting point is inside the loop. That is because vreg1 has reference inside loop not interfering with a physical register. In order to let vreg1 references inside loop be allocated to physical register as many as possible, region based spliitting choose to split inside loop. Although the splitted vreg1' inside loop evict vreg2 and get physical register in the end, there are extra reg movs left inside the loop. PR24278 belongs to this case.
  • Vreg1 has only 1 reference in loop1 but vreg2 has 4 reference in loop1. vreg1 also has many reference in loop2 so overall vreg1 has higher weight than vreg2. Loop1 has high register pressure and only one physical register is left for vreg1 and vreg2. Finally vreg1 get the physical register because it has higher weight. However, it is unoptimal for loop1 because vreg2 actually has more reference in loop1. For this case, it is better to split both vreg1 and vreg2 around loop1, so the weight of vreg1' and vreg2' around the loop after split can be computed all based on the references inside loop1, and then vreg2' will have higher weight than vreg1' and it will get the physical register. Current region splitting only tries to find holes from the live ranges of physical register being used. It will not try the splitting here. This is an assumed case. I didn't try to catch such testcase from testsuites but I think it probably exists.

From the cases analysis above, to optimize register allocation for a hot loop with high register pressure to the best, we may want to split all the virtregs which live across the loop at the loop boundary. In such way, the hot loop will become an independent regalloc region isolated from regalloc outside the region. The patch here implements such splitting.

The logic to implement splitting around loop has two major steps. The first step is: If a vreg lives across a hot loop and it has real reference inside the loop, we will only split the vreg around the loop. If the new vreg around loop generated after splitting still cannot get physical register, we will try the second step - to split all the vregs living across the loop once in all. The intuition of the first step is, if the new vreg around loop after splitting still cannot get physical register, it tells us definitively the loop has high register pressure, and only then we will try the second step.

Another part necessary is to split critical edges for loop entry and exit so that the splitting can happen at the loop boundary. A separate pass is added to do that. I find it is also helpful for phi-elimination and shrinkwrapping.

Performance on x86-64:
0.5% improvement on avg for google internal benchmarks.
Some improvements on llvm testsuite, no apparent regression saw.

SingleSource/Benchmarks/Shootout/matrix: 10.54%
SingleSource/Benchmarks/Adobe-C++/stepanov_vector: 3.41%
MultiSource/Benchmarks/sim/sim: 7.31%
MultiSource/Applications/siod/siod: 9.59%
(The above four are improved because of reg splitting around loop)
SingleSource/Benchmarks/Shootout-C++/ackermann: 21.3%
SingleSource/Benchmarks/BenchmarkGame/recursive: 11.02%
(These two are improved because splitting critical edges enables more shrinkwrapping)
MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk: 1.73%
(This testcase is improved because after splitting critical edges, phi-elimination can generate a copy outside loop)

Compile time:
There is 1.12% compile time increase for spec2006 all C/C++ benchmarks. For one file I looked at, the increased compile is majorly spent on propagateSiblingValues. When I use -split-spill-mode=size for both base and comparison, the compile time increase drops to 0.8%. I will work on it and try to reduce the compile time increase more.

Problems to solve - Better hints handling.
Loop splitting introduces more vreg copies at the loop boundary. It depends on register allocation to coalesce such copies using hints. However, current hint setting and using in regalloc is not optimal.
A typical case we found is like this: vreg2 and vreg3 are splitted from vreg1. vreg1 has a hint of R8 so vreg2 and vreg3 both have hint of R8. However vreg2 finally gets R10 assigned and then vreg3 also gets R10 because its hint changes from R8 to R10. At this time, the hint of vreg2 is still R8. Depriving R10 from vreg2 will not be regarded as a BrokenHint, which leads to easy deprivation of R10 from vreg2. The copy between vreg2 and vreg3 cannot be removed.
The case we saw was at the loop boundary in a cold outer loop so it didn't cause perf regression, but it is possible that it happened elsewhere I didn't notice.
The hint handling problem exposed here is a general issue and may worth some effort to improve.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 47896.Feb 12 2016, 9:40 PM
wmi retitled this revision from to [Greedy Regalloc] Reg splitting based on loops.
wmi updated this object.
wmi added a reviewer: qcolombet.
wmi set the repository for this revision to rL LLVM.
wmi added subscribers: llvm-commits, davidxl.
qcolombet edited edge metadata.Feb 15 2016, 10:50 AM

Hi Wei,

Thanks for looking into this.
I haven't looked at the details of the patch, but here are a few comments:

  • The PHIElimination pass already has an option to SplitAllCriticalEdges. Please check how we can use this instead of introducing a new pass.
  • The split around regions should already consider loops. Why is this not doing the right thing? Could you work toward improving the heuristic there instead of introducing another splitting scheme?

Cheers,
-Quentin

wmi added a comment.Feb 15 2016, 12:12 PM

Thanks for taking a look at it.

  • The PHIElimination pass already has an option to SplitAllCriticalEdges. Please check how we can use this instead of introducing a new pass.

Thanks, I will check it.

  • The split around regions should already consider loops. Why is this not doing the right thing? Could you work toward improving the heuristic there instead of introducing another splitting scheme?

The split around regions split current vreg into sub vregs to fit holes unused by physical registers. It considers loop only in order not to put the splitting point inside loop when it is unnecessary. The motivation of splitting is to fill holes.

For the first case listed above, it is required to split the current vreg around loop so the sub vreg used inside loop can evict other vregs occupying physical register but having no use inside loop. The splitting there is to better evict other unimportant vregs, so the motivation is notably different with region split.

As for where to split, current region splitting algorithm uses Hopfield Networks to decide it. loop splitting doesn't need that because where to split is determined -- .i.e, at the loop boundary.

Another difference is current region splitting is limited to split current vreg while the loop splitting proposed here can split multiple vregs at the same time, no matter those vregs are in the queue or already being assigned physical registers.

These are the reasons I feel it is not easy to fit the loop splitting requirement into current region splitting algorithm. However, logic to decide when and how to do loop splitting is relative easy, plus existing splitting transformation kit and live interval update are reused, so there is not too much new stuff added in the patch here actually.

Thanks,
Wei.

Thanks Wei for the additional inputs.
That’ll help me when I get to the review.
That being said, the fact that we consider several live-ranges when doing the splitting worries me complexity-wise. I’ll see if my concerns are real or not.

Cheers,
-Quentin

wmi updated this revision to Diff 48236.Feb 17 2016, 2:01 PM
wmi edited edge metadata.

LoopBase<BlockT, LoopT>::getExitBlocks needs to iterate all the blocks inside the loop everytime it is called. Because getExitBlocks gets called many times in loop based reg splitting, it is better to cache the ExitBlocks for every loop (Used a map LoopExitBlocks in RAGreedy).

With this change, the compile time increase drops from 0.8% to 0.43% when -split-spill-mode=size is on.

jonpa added a subscriber: jonpa.May 23 2017, 10:16 AM
lkail added a subscriber: lkail.Apr 19 2022, 1:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2022, 1:20 AM