This is an archive of the discontinued LLVM Phabricator instance.

[RA] Split a virtual register in cold blocks if it is not assigned preferred physical register
ClosedPublic

Authored by Carrot on Jul 27 2023, 2:40 PM.

Details

Summary

If a virtual register is not assigned preferred physical register, it means some COPY instructions will be changed to real register move instructions. In this case we can try to split the virtual register in colder blocks, if success, the original COPY instructions can be deleted, and the new COPY instructions in colder blocks will be generated as register move instructions. It results in fewer dynamic register move instructions executed.

The new test case split-reg-with-hint.ll gives an example, the hot path contains 24 instructions without this patch, now it is only 4 instructions with this patch.

Diff Detail

Event Timeline

Carrot created this revision.Jul 27 2023, 2:40 PM
Herald added a reviewer: MaskRay. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
Carrot requested review of this revision.Jul 27 2023, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2023, 2:40 PM

Interesting! This is showing some neat improvements enabling more shrink-wrapping in the test-cases.

Though I suspect changes of the splitting algorithm could trigger regressions (given it's all heuristics). So I would feel better if this change was backed up by more statistics. (Something like compiling llvm-test-suite and collecting regalloc stats or remarks?) and/or report any other benchmarks you did?

arsenm added a subscriber: arsenm.Jul 27 2023, 2:52 PM
arsenm added inline comments.
llvm/test/CodeGen/AMDGPU/ran-out-of-sgprs-allocation-failure.mir
105

This is a huge improvement (though this heuristic is probably not the reason it should have split)

MatzeB added a reviewer: wmi.Jul 27 2023, 2:55 PM
MatzeB added inline comments.Jul 27 2023, 3:04 PM
llvm/test/DebugInfo/X86/live-debug-values.ll
31–32

uhoh...

Interesting! This is showing some neat improvements enabling more shrink-wrapping in the test-cases.

Though I suspect changes of the splitting algorithm could trigger regressions (given it's all heuristics). So I would feel better if this change was backed up by more statistics. (Something like compiling llvm-test-suite and collecting regalloc stats or remarks?) and/or report any other benchmarks you did?

Regressions are really observed, such as ARM/fptosi-sat-scalar.ll.
Internal analysis shows it has big impact to tcmalloc and protobuf parsing. I will run it on SPEC INT.

llvm/test/CodeGen/AMDGPU/ran-out-of-sgprs-allocation-failure.mir
105

Thank you for the confirmation, I really don't what it is doing.

llvm/test/DebugInfo/X86/live-debug-values.ll
31–32

I guess the debug information is not maintained well when live range split occurs.

Carrot added inline comments.Jul 27 2023, 3:29 PM
llvm/test/CodeGen/AMDGPU/ran-out-of-sgprs-allocation-failure.mir
105

I really don't know what it is doing.

lkail added a subscriber: lkail.Jul 27 2023, 10:18 PM
Carrot updated this revision to Diff 547287.Aug 4 2023, 11:30 AM

Improve the cost computation when the VirtReg is assigned a non hint physical register, it fixed several regressions.

qcolombet requested changes to this revision.Aug 8 2023, 4:26 AM
qcolombet added inline comments.
llvm/lib/CodeGen/RegAllocGreedy.cpp
2404

Splitting decisions are supposed to be taken later (trySplit), this breaks the general algorithm's flow.

This revision now requires changes to proceed.Aug 8 2023, 4:26 AM
qcolombet added inline comments.Aug 8 2023, 5:16 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
2404

Concretely, I'm asking if you could do this kind of splitting in the splitting heuristic.
You may want to add a new stage in the algorithm (and that'll come at a compile time cost.)

Carrot added inline comments.Aug 8 2023, 10:33 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
2404

The problem is we need to decide between a physical register and split, if we move it to trySplit, we will miss the possibility of assigning a physical register to the whole virtual register. trySplit compares splits against different physical registers, we need to compare split against non-split.
Function tryAssignCSRFirstTime does similar thing. trySplitAroundHintReg is even more special because it tries to split against only 1 physical register.

Interesting! This is showing some neat improvements enabling more shrink-wrapping in the test-cases.

Though I suspect changes of the splitting algorithm could trigger regressions (given it's all heuristics). So I would feel better if this change was backed up by more statistics. (Something like compiling llvm-test-suite and collecting regalloc stats or remarks?) and/or report any other benchmarks you did?

Following are regalloc stats when compiler SPECINT2006 with FDO

        stats                                            old       new
Number of copies inserted for splitting                103020    121524
Number of splits finished                               19248     24346
Number of split global live ranges                      11068     16305
Number of new live ranges queued                       157359    172528
Number of rematerialized defs for splitting              2756      5536
Number of splits that were simple                        8605     12145

Number of registers assigned                           649170    663979
Number of instructions deleted by DCE                   83051     84678
Number of interferences evicted                         21432     21320
Number of folded stack accesses                         10992     10914
Number of folded loads                                    623       624
Number of live ranges fractured by DCE                    406       390
Number of identity moves eliminated after rewriting    232947    240607
Number of dead lane conflicts tested                     7451      7503
Number of dead lane conflicts resolved                   3943      3957
Number of split local live ranges                        2280      2230
Number of instructions rematerialized                  122540    126312
Number of instructions re-materialized                  93828     94520
Number of reloads inserted                              50031     51344
Number of reloads removed                                 827       787
Number of rematerialized defs for spilling              25833     26237
Number of shrinkToUses called                           94489     95202
Number of spilled snippets                               1083      1100
Number of spill slots allocated                         15944     16053
Number of spilled live ranges                           28650     28776
Number of spills inserted                               22428     22448
Number of spills removed                                  878       778
Number of registers unassigned                          26406     26478
Number of instruction commuting performed                 640       650
Number of cross class joins performed                  159091    161155
Number of copies extended                                  15        14
Number of interval joins performed                     508329    506998
Number of register classes inflated                         4         4
Number of single use loads folded after DCE                32        32

The first 6 stats are much larger than old values, these are direct results of the new splitting. Other stats numbers are not significantly impacted.

The runtime performance of SPECINT2006 on my desktop is

without this patch
400.perlbench                                    9770        227       43.1 *
401.bzip2                                        9650        370       26.1 *
403.gcc                                          8050        194       41.6 *
429.mcf                                          9120        198       46.1 *
445.gobmk                                       10490        342       30.7 *
456.hmmer                                        9330        246       37.9 *
458.sjeng                                       12100        372       32.5 *
462.libquantum                                  20720        297       69.9 *
464.h264ref                                     22130        328       67.5 *
471.omnetpp                                      6250        233       26.8 *
473.astar                                        7020        294       23.9 *
483.xalancbmk                                    6900        159       43.4 *
 Est. SPECint(R)_base2006           Not Run
 Est. SPECint2006                                                      38.5


with this patch
400.perlbench                                    9770        227       43.0 *
401.bzip2                                        9650        373       25.8 *
403.gcc                                          8050        191       42.1 *
429.mcf                                          9120        200       45.6 *
445.gobmk                                       10490        346       30.3 *
456.hmmer                                        9330        249       37.5 *
458.sjeng                                       12100        376       32.2 *
462.libquantum                                  20720        310       66.8 *
464.h264ref                                     22130        324       68.2 *
471.omnetpp                                      6250        232       27.0 *
473.astar                                        7020        280       25.1 *
483.xalancbmk                                    6900        158       43.7 *
 Est. SPECint(R)_base2006           Not Run
 Est. SPECint2006                                                      38.5

The final scores are same, but 462.libquantum and 473.astar have big difference. I double checked them.

462.libquantum
It has big variation for each run. Because this patch mainly impact the dynamic number of move instructions, I checked the executed instructions for two version, they are 2314346841712 vs 2314346834848, basically no difference. I also checked the hottest 6 functions, they are same in both versions. So there is no regression in 462.libquantum.

473.astar
This benchmark has a very stable run time. The performance difference is consistently reproduced. I also checked the dynamic instruction numbers, the new version has less instructions executed. So this improvement is real.

input                    old             new
BigLakes2048.cfg    334603128498    322906773354
rivers.cfg          663093493368    653682859787

ping.
The 473.astar is improved by 5%.

qcolombet added inline comments.Sep 13 2023, 8:50 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
2404

Function tryAssignCSRFirstTime does similar thing.

True but I feel it is justified because these physreg are not free to use the first time.
I don't see this argument applying here.

if we move it to trySplit, we will miss the possibility of assigning a physical register to the whole virtual register.

Hmm, I don't get that part, we're going to split in trySplitAroundHintReg, how do we get to assign the same physreg to the whole vreg?

we need to compare split against non-split.

I don't get that part either.
For a given vreg, the first time we get through trySplit, we compare non-split against split.

I'm not totally opposed to the patch, but I still fail to see why it doesn't fit in the regular splitting mechanism and hence why we need a special mechanism for this.

Carrot added inline comments.Sep 13 2023, 10:52 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
2404

Function tryAssignCSRFirstTime does similar thing.

True but I feel it is justified because these physreg are not free to use the first time.
I don't see this argument applying here.

The key point here is we need to compare the cost of using CSR(for the first time) against the cost of splitting the virtual register. Then we can decide to split the virtual register or assign a single physical register to it.

if we move it to trySplit, we will miss the possibility of assigning a physical register to the whole virtual register.

Hmm, I don't get that part, we're going to split in trySplitAroundHintReg, how do we get to assign the same physreg to the whole vreg?

We already get a physical register before calling trySplitAroundHintReg. If trySplitAroundHintReg can't find a split better than a non-hint physical register, it returns false, and the caller tryAssign can simply return the physical register for the whole vreg.

In function trySplitAroundHintReg it first computes the cost of using non-hint physical register for the whole vreg, then it passes the cost to calculateRegionSplitCostAroundReg, calculateRegionSplitCostAroundReg only records split candidate with lower cost, so trySplitAroundHintReg actually compares using a non-hint physical register for the whole vreg against splitting the vreg around the hint register.

we need to compare split against non-split.

I don't get that part either.
For a given vreg, the first time we get through trySplit, we compare non-split against split.

Sorry for the confusion of non-split. I mean compare split against using a single physical register.

I'm not totally opposed to the patch, but I still fail to see why it doesn't fit in the regular splitting mechanism and hence why we need a special mechanism for this.

qcolombet accepted this revision.Sep 13 2023, 11:58 PM

Thanks for the clarifications @Carrot

llvm/lib/CodeGen/RegAllocGreedy.cpp
2404

Got it. Makes sense.

Now, I understand the comment around trySplitAroundHintReg.

This revision is now accepted and ready to land.Sep 13 2023, 11:58 PM
This revision was landed with ongoing or failed builds.Sep 15 2023, 12:55 PM
This revision was automatically updated to reflect the committed changes.
skan added a subscriber: skan.Sep 15 2023, 11:49 PM
skan added a subscriber: wxiao3.Sep 15 2023, 11:56 PM
skan added inline comments.
llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll
78

@wxiao3 @pengfei It seems a regression for x86.

Carrot added inline comments.Sep 16 2023, 12:55 AM
llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll
78

BB6 has higher frequency than BB7. If the branch in BB6 has 50% of taken probability, the final number of executed instructions are same.

fhahn added a subscriber: fhahn.Sep 22 2023, 7:16 AM

It looks like this is causing a ~10% performance regression on some of our workloads on ARM64 macOS. Working on a reproducer.

This patch also seems to cause near-infinite compile times in same cases (don't know if they actually finish as the process gets killed via OoM). Trying to get a reproducer now.

@Carrot created an issue with the hang reproducer: https://github.com/llvm/llvm-project/issues/67188

Thanks for the report!
While I'm investigating, you can use -split-threshold-for-reg-with-hint=0 to unblock your work.

anna added a subscriber: anna.EditedSep 22 2023, 1:09 PM

@Carrot created an issue with the hang reproducer: https://github.com/llvm/llvm-project/issues/67188

Thanks for the report!
While I'm investigating, you can use -split-threshold-for-reg-with-hint=0 to unblock your work.

Hi! If it takes too long, can you pls check that change in or revert this change, since we are having a broken ToT at the moment?

hans added a subscriber: hans.Oct 3 2023, 1:38 AM

We're seeing binary size increases in Chromium, in particular for Android and Fuchsia where size is critical: https://crbug.com/1488374
Is that an inherent property of this change, and could it be scaled back for optsize functions for example?

Carrot added a comment.Oct 3 2023, 8:32 AM

We're seeing binary size increases in Chromium, in particular for Android and Fuchsia where size is critical: https://crbug.com/1488374
Is that an inherent property of this change, and could it be scaled back for optsize functions for example?

This change is driven by the cost of using a single physical register and split a register. As most cost computations in RA, these are based on the weighted number of instructions (dynamic number of instructions or performance), static number of instructions is not considered. So its impact to code size is random. You can use -split-threshold-for-reg-with-hint=0 to disable this optimization.

hans added a comment.Oct 4 2023, 3:54 AM

We're seeing binary size increases in Chromium, in particular for Android and Fuchsia where size is critical: https://crbug.com/1488374
Is that an inherent property of this change, and could it be scaled back for optsize functions for example?

This change is driven by the cost of using a single physical register and split a register. As most cost computations in RA, these are based on the weighted number of instructions (dynamic number of instructions or performance), static number of instructions is not considered. So its impact to code size is random. You can use -split-threshold-for-reg-with-hint=0 to disable this optimization.

It does consistently increase binary size (where we measured, which is Android and Fuchsia), so I wouldn't call it random. We are using -split-threshold-for-reg-with-hint=0 to disable this for now, but we really don't want to use internal compiler flags, and I don't think we can expect users to do so in general.

Did you look at binary size in your benchmarking? If we can't make it size neutral, I think we should consider turning it off in optsize functions.

We're seeing binary size increases in Chromium, in particular for Android and Fuchsia where size is critical: https://crbug.com/1488374
Is that an inherent property of this change, and could it be scaled back for optsize functions for example?

This change is driven by the cost of using a single physical register and split a register. As most cost computations in RA, these are based on the weighted number of instructions (dynamic number of instructions or performance), static number of instructions is not considered. So its impact to code size is random. You can use -split-threshold-for-reg-with-hint=0 to disable this optimization.

It does consistently increase binary size (where we measured, which is Android and Fuchsia), so I wouldn't call it random. We are using -split-threshold-for-reg-with-hint=0 to disable this for now, but we really don't want to use internal compiler flags, and I don't think we can expect users to do so in general.

Did you look at binary size in your benchmarking? If we can't make it size neutral, I think we should consider turning it off in optsize functions.

I thought about it more. There are two typical cases that can change size.

  1. One COPY instruction between virtual register and physical register in hot block, it constructs a hint for the virtual register, they interfere in two or more cold blocks, we can split the virtual register in cold blocks, it increases code size.
  2. Two or more COPY instructions between same virtual register and physical register in hot blocks, they construct a hint for the virtual register, they interfere in one cold block, we can split the virtual register in the cold block, it decreases code size.

The case 1 is more common than case 2. So the it has more possibility to increase size. So it's reasonable to disable it for optsize functions. I will send a patch for it.