Page MenuHomePhabricator

[Greedy regalloc] Replace analyzeSiblingValues with something new
ClosedPublic

Authored by wmi on Dec 7 2015, 1:51 PM.

Details

Summary

The change is to solve PR17409 (https://llvm.org/bugs/show_bug.cgi?id=17409) and its several duplicates. The change is divided into three parts for easier review.

The major issue of analyzeSiblingValues is when a virtreg is splitted to N siblings with the same original VNI, and there are some PHIDefs generated in its splitting, the VNInfo of every sibling will be added to the Dependents of all other siblings, which creates a NxN network. traceSiblingValue is propagating SibValue info through this NxN network so it has NxN time complexity. In addition, selectOrSplit is called for all the N siblings sequentially. When reg pressure is high, a large percentage of siblings will be spilled (let's suppose N/2 siblings will be spilled), and traceSiblingValue will be called N/2 times indirectly from selectOrSplit, then there will be N^3 time complexity in total.

analyzeSiblingValues has two major usages: One is to figure out the SibValueInfo::SpillVNI of the virtReg to be spill so the spill can be hoisted to the place after SpillVNI->def and redundent spills are eliminated at the same time. The other is to trace the sibling copies back to the original value so the computation of the original value can be used for rematerialization. We replace analyzeSiblingValues by reimplementing these functionalities in Part1 and Part2.

Part1:
Instead of figure out the place to hoist spill for each virtReg to be spilled, we do that all at once when allocatePhysRegs is done. With all spills in place, we group spills with the same values (having the same OrigVNI). For each group of the spills with equal values, first we remove redundent spills dominated by other spill in the group, then traverse the dominate tree in post-order and hoist the spills to less frequent dominate tree node. Since spill can be hoisted to a cold dominate tree node without any sibling's VNI->def in the node, it can be better than the original implementation.

I didn't follow Jakub's proposal in pr17409 to change hoistCopiesForSize because redundent backcopies seen in one round of splitting is limited. Suppose Vreg1 is splitted to Vreg2, Vreg3 and Vreg4 in the first round of splitting, and Vreg4 is further splitted to Vreg5 and Vreg6 in the second round of splitting, the redundent backcopies between Vreg2=Vreg3 and Vreg5=Vreg6 cannot be found (I caught more than 100 such cases in llvm testsuites which left redundent spills in the final asm code)

Part2:
To find out the computation of the original value for rematerialization, we always query the inst at OrigVNI->def. To handle the case that the inst at OrigVNI->def has been removed during rematerialization, we change rematerialization to not delete the inst at OrigVNI->def even if it is already dead. In stead, we change the dest vreg to a new vreg (The new vreg will not be reg allocated so it will not affect the allocation of other vregs), save the inst to a set named as DeadRemats, and shrink the original dest vreg in the same way as previous. The insts in DeadRemats will be removed after allocatePhysRegs is done.

Part3:
The Part3 of the change is to clean up the code related with analyzeSiblingValues.

Test with all three parts combined on x86_64-linux-gnu:

  1. The compile time for 1.c in pr24618 dropped from 0.34s to 0.25s The compile time for interpreter_goto_table_impl.ii in pr24618 dropped from 176.80s to 66.86s. I cannot verify the patch using tests in pr17409 because most bugs related with asan/ubsan have been workaround on sanitizer side.
  1. llvm testsuites. Perf: mostly neutral except one perf regression I havn't addressed: SingleSource/Benchmarks/Misc/mandel. The reason has been understood. It is because we didn't do local spill hoist which the original implementation did. The usage of the local hoist is described in the comment in propagateSiblingValue (Starting with // This is an alternative def earlier in the same MBB.) The local hoist cannot be done after allocatePhysRegs so I will address it in a separate patch. CC time: MultiSource/Applications/sqlite3/sqlite3 has 1.5% improvement steadily.
  1. I havn't cleaned up the llvm unit tests because I expect there will be many changes to the patches during the review. I will clean them up later.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
wmi added inline comments.Mar 21 2016, 9:46 AM
include/llvm/CodeGen/LiveRangeEdit.h
123

Fixed.

148

I Added comments to explain it.
In short, we don't want to allocate phys register for the dummy register used as temporary dst register of instruction in DeadRemats set.

184

You are right. I don't have to save OrigVNI in struct Remat. Instead, add a parameter VNInfo *OrigVNI for LiveRangeEdit::canRematerializeAt.

221

Fixed.

235–239

Fixed.

lib/CodeGen/InlineSpiller.cpp
79

Fixed

81

Fixed

82

Most of the cases the size of it is less than 16 I guess, so I use SmallSetVector instead. I cannot use SmallSet because the set needs to be iterated.

85

Fixed.

90

Fixed.

213

Make HoistSpillHelper a field in inline spiller.

223

I Added postOptimization as a pure virtual func in class Spiller, and put the code of hoist spiller helper inside of InlineSpiller::postOptimization.

998

Fixed.

998

Fixed.

1010

Yes, when turn on split-mode-size, after hoistCopies, it is possible that several siblings live at the same time. it is not just therotically possible but realistic.

I use SmallSetVector instead here.

1024

Thanks for your example comment. It is good. I copy most of them to the code.

Should we consider to moving some of the arguments to field of the current hoistspill instance?

hoistSpillHelper now has the same lifetime as InlineSpiller instance, so the lifes of those arguments are much shorter than that. That is why I choose to keep them as func local objects.

1024

After separate part of the work into HoistSpillHelper::rmRedundentSpills and add more comments in the func body, it may make the code easier to read now.

1043

I separate the first part of the work to another func: HoistSpillHelper::rmRedundentSpills, so HoistSpiller::getVisitOrders is more focus on what its name describes.

1051

Done.

1053

Done.

1065

Done.

1078

Done.

1098

Done.

1120

Yes, that is right.

1135

Fixed.

1137

Fixed.

1174

I keep it along side the subtree in SpillsInSubTreeMap.

1181

Ok, I will do it in a follow-up patch.

1208

Fixed.

1210

Fixed.

1218

Done.

1234

I simply remove the inner loop. SlotToOrigReg map will become somewhat bigger, but not a lot.

1242

Fixed.

lib/CodeGen/RegAllocGreedy.cpp
401

Done.

2571

Done.

2597

Done.

lib/CodeGen/RegAllocPBQP.cpp
130

InlineSpiller is shared by all kinds of register allocator, so the DeadRemats logic is also needed by PBQP. If I separate the part out, I need to fix the related unit test.

731

My original comment that DeadRemats is non-empty only when the regalloc is Greedy is wrong. Actually, InlineSpiller and related Remat logic are shared by all register allocators.

And RegAllocPBQP is not a subclass of RegAllocBase, so the code is needed.

lib/CodeGen/RegisterCoalescer.cpp
463

Fixed.

lib/CodeGen/Spiller.h
38

I don't get the intention to add a bool here. Is it used to guard post optimization? why it is needed?

46

Done.

lib/CodeGen/SplitKit.cpp
727

Comment removed.

749

Fixed.

1134

Added.

lib/CodeGen/SplitKit.h
333

Fixed.

335

Fixed.

338

Fixed.

test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
24 ↗(On Diff #51175)

This is change confusing to me, because if we only use 254 VGPRs then there shouldn't be any spills, but there are still spill instructions being emitted. It seems like this is probably a bug, but I will need to look at it more closely to see if it is an AMDGPU bug or a generic regalloc bug.

wmi added a comment.Mar 21 2016, 11:06 AM

I noticed that even without my change, although compiler output "GCN:
NumVgprs is 256", when I looked at the trace of -debug-only=regalloc,
I found there were some VGPR unused.

Here is what I did:
~/workarea/llvm-r262808/dbuild/./bin/llc -march=amdgcn -mcpu=tahiti
-mattr=+vgpr-spilling -verify-machineinstrs <
~/workarea/llvm-r262808/src/test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
-debug-only=regalloc >/dev/null 2>out1

Delete the trace from out1 before the section of "REGISTER MAP", then
execute the command below:
for ((i=0; i<256; i++)); do

grep "VGPR$i[^0-9]" out1 &>/dev/null
if [[ "$?" != "0" ]]; then
  echo VGPR$i
fi

done

The output is:
VGPR40
VGPR189
VGPR190

So even if the compiler says GCN: NumVgprs is 256, there are three
VGPRs never used.

Thanks,
Wei.

Hi Wei,

I believe we are almost done. Thanks for your work and patience on this.

There are mainly three items to address:

  • There are typos widely spread in the file; mergable -> mergeable, redundent -> redundant
  • Do not repeat method names on comment.
  • Fix on the test cases. See the inline comment.

As for the benchmarking, almost all the diffs came back as improvement of up to 7%! This is impressive.
The regressions seem like side effect, i.e., we generate less load pair in a few case, because the related spill slots are not next to each other anymore. This was luck previously.

Anyhow, looking forward for the final fix-ups.

Cheers,
-Quentin

include/llvm/CodeGen/LiveRangeEdit.h
78

Switch to SmallPtrSetImpl, this the size of the type is not relevant.

150

Don’t repeat the method name in the comment.

lib/CodeGen/InlineSpiller.cpp
76

Mergeable.
Do a search, the typo is widely spread :).

332–333

Don’t repeat the method name.

974

Don’t repeat the name of the method.

985

More mergeable typos...

lib/CodeGen/LiveRangeEdit.cpp
383

Some update problem I believe.

lib/CodeGen/RegAllocBase.cpp
159

Capitale letter for the first letter of the variable name.

lib/CodeGen/RegAllocBase.h
88

Add virtual keyword.
Subclasses may want to do additional things.

lib/CodeGen/RegAllocPBQP.cpp
728

Variables start with a capital letter.

lib/CodeGen/Spiller.h
31–32

Other spillers out-of-tree may exist and there is little interest in having them to implement a post optimization method if they do not need it.
In other words, instead of a pure virtual method, do nothing for the default implementation.

38–46

I was thinking in case we want to test without the post-optimization.
But I am fine if it is always enabled.

lib/CodeGen/SplitKit.h
335

Typo: redundant

test/CodeGen/X86/hoist-spill.ll
2

Make this a file check test.

116

Get rid of the attributes if they are not actually needed.

test/CodeGen/X86/new-remat.ll
13

Use opt -instnamer to get rid of the %[0-9]+ variables.

In D15302#379497, @wmi wrote:

I noticed that even without my change, although compiler output "GCN:
NumVgprs is 256", when I looked at the trace of -debug-only=regalloc,
I found there were some VGPR unused.

Here is what I did:
~/workarea/llvm-r262808/dbuild/./bin/llc -march=amdgcn -mcpu=tahiti
-mattr=+vgpr-spilling -verify-machineinstrs <
~/workarea/llvm-r262808/src/test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
-debug-only=regalloc >/dev/null 2>out1

Delete the trace from out1 before the section of "REGISTER MAP", then
execute the command below:
for ((i=0; i<256; i++)); do

grep "VGPR$i[^0-9]" out1 &>/dev/null
if [[ "$?" != "0" ]]; then
  echo VGPR$i
fi

done

The output is:
VGPR40
VGPR189
VGPR190

So even if the compiler says GCN: NumVgprs is 256, there are three
VGPRs never used.

NumVgprs is the number of VGPRs that need to be allocated for the program, so the fact that there are gaps doesn't matter (though this is strange). If you use only register v255, you still need to allocate all 256 registers.

wmi updated this revision to Diff 51245.Mar 21 2016, 4:52 PM
wmi edited edge metadata.

Fix all the comments Quentin suggested. Thanks for the careful review.

lib/CodeGen/InlineSpiller.cpp
76

Fixed.

332–333

Fixed.

974

All similar comments fixed.

985

All such typos Fixed.

lib/CodeGen/LiveRangeEdit.cpp
383

Fixed.

lib/CodeGen/RegAllocBase.cpp
159

Fixed.

lib/CodeGen/RegAllocBase.h
88

Fixed.

lib/CodeGen/RegAllocPBQP.cpp
728

Fixed.

lib/CodeGen/Spiller.h
31–32

Make sense. Fixed.

38–46

Ok, I leave it there for now.

lib/CodeGen/SplitKit.h
335

Fixed here and many other places.

test/CodeGen/X86/hoist-spill.ll
2

I felt the file check test was not as general as the above test, but filecheck can still work, so I switch to file check here.

116

Fixed.

test/CodeGen/X86/new-remat.ll
13

Fixed.

Hi Wei,

I think we will need to wait for Tom to double check what happened for AMDGPU.

One question though, this revision ended up being the combination of the 3 parts, right?

Cheers,
-Quentin

test/CodeGen/X86/hoist-spill.ll
3

You could check where the spills actually are. But it already looks pretty good now :).

wmi added a comment.Mar 29 2016, 9:47 AM

So even if the compiler says GCN: NumVgprs is 256, there are three

VGPRs never used.

NumVgprs is the number of VGPRs that need to be allocated for the program, so the fact that there are gaps doesn't matter (though this is strange). If you use only register v255, you still need to allocate all 256 registers.

Hi Tom,

I found with my patch here, the Spill num for the testcase increases
from 68 to 152, and Reload num increases from 72 to 188. I havn't
throughly understood what is wrong here, but I can roughly describe
how the problem happen and say it may be a problem of local splitting,
instead of my patch.

In the testcase, there are roughly 64 VReg_128 vars overlapping with
each other consuming all the 256 VGPRs and some other scattered VGPR
uses. Each VReg_128 var occupies 4 consecutive VGPRs, so VGPR
registers are allocated in this way: vreg1: VGPR0_VGPR1_VGPR2_VGPR3;
vreg2: VGPR4_VGPR5_VGPR6_VGPR7; ......

Because we have some other scattered VGPR uses, we cannot allocate all
the 64 VReg_128 vars in register, so splitting is needed. region
splitting will not bring trouble because it only tries to fill holes,
i.e., vregs after the splitting usually will not evict other vregs.
local splitting can bring a lot of mess to the allocation here.
Suppose it tries to find a local gap inside BB to split vreg3
(VReg_128 type). After the local split is done, vreg3 will be splitted
into vreg3-1 and vreg3-2. vreg3-1 and vreg3-2 have short live ranges
so both of them have relatively larger weight. vreg3-1 may find a hole
and is allocated to VGPR2_VGPR3_VGPR4_VGPR5, then vreg3-2 will get a
hint of VGPR2_VGPR3_VGPR4_VGPR5 and will evict vreg1
(VGPR0_VGPR1_VGPR2_VGPR3) and vreg2 (VGPR4_VGPR5_VGPR6_VGPR7) above.
To find consecutive VGPRs for vreg1 and vreg2, reg alloc will do more
region splitting/local splitting and more evictions, and causes more
and more vregs hard to find consecutive VGPRs.

With my patch, it will add one more VReg_128 interval during splitting
because of hoisting (This is a separate problem I described in a TODO
about improving hoistCopies in previous reply). To allocate the
VReg_128 var, it triggers more region splitting and local splitting,
and makes more vars spilled.

To show the problem, I experimentally turn off local splitting for
trunk without my patch, the Spill num for the testcase drops from 68
to 56, and Reload num drops from 72 to 36. When turn off local
splitting for trunk with my patch, the Spill num for the testcase
drops from 152 to 24, and Reload num drops from 188 to 24.

So this is probably a separate issue for architecture using
consecutive combined registers for large data type.

Thanks,
Wei.

Hi Tom,

Do you think the issue is a blocker for this patch or a separated one?
Want to get your confirmation so I can decide how to push the work
forward.

As for using 254 VGPRs instead of 256 VGPRs, I think it just cannot
find 4 consecutive VGPRs for VReg_128 data. The holes in the end (v254
v255) have no difference with holes in the middle. Is it correct?

Thanks,
Wei.

I think your analysis is correct about why it doesn't use all 256 register. I actually hit this same thing in another patch I'm working on. I have to objections to this patch being pushed.

It turns out this was ready just in time: we just noticed that r263460 essentially undermines all of the work to avoid PR17409, and we now have widespread superlinear compile times with sanitizers (and possibly other code).

Just wanted to confirm with you Tom that this LGTM, and encourage Wei to go ahead and land it as soon as Tom acks. =D We have a *bunch* of stuff blocked on the compile time issues here.

One minor nit below.

lib/CodeGen/InlineSpiller.cpp
57

I get a warning saying this is unused when building with this patched in...

wmi added a comment.Apr 2 2016, 6:19 PM

Thanks for the support of this patch. Looks like Tom's "to objection"
is a typo of "no objection". I will prepare to commit the patch.

Wei.

wmi updated this revision to Diff 52569.Apr 4 2016, 9:46 AM

Fix my mistake introduced when I was addressing the review comments:

I accidentally remove the virtual keyword of postOptimization in lib/CodeGen/Spiller.h. It should not be a pure virtual function, but still should be virtual.

This will fix the warning Chandler saw.

This revision was automatically updated to reflect the committed changes.
wmi retitled this revision from [Greedy regalloc] Replace analyzeSiblingValues with something new [Part1] to [Greedy regalloc] Replace analyzeSiblingValues with something new.Apr 4 2016, 9:49 AM
wmi added a comment.Apr 9 2016, 11:03 AM

Hi Quentin,

Recently, I committed some bug fixes for the patch here without getting approvement first because I think they are relatively trivial. Please give them a postcommit review: http://reviews.llvm.org/D18934

There are another two fixes which are somewhat substantial, which I think needs to be reviewed before commit.
http://reviews.llvm.org/D18935
http://reviews.llvm.org/D18936

Thanks,
Wei.