This is an archive of the discontinued LLVM Phabricator instance.

[Greedy regalloc] Replace analyzeSiblingValues with something new
ClosedPublic

Authored by wmi on Dec 7 2015, 1:51 PM.

Details

Summary

The change is to solve PR17409 (https://llvm.org/bugs/show_bug.cgi?id=17409) and its several duplicates. The change is divided into three parts for easier review.

The major issue of analyzeSiblingValues is when a virtreg is splitted to N siblings with the same original VNI, and there are some PHIDefs generated in its splitting, the VNInfo of every sibling will be added to the Dependents of all other siblings, which creates a NxN network. traceSiblingValue is propagating SibValue info through this NxN network so it has NxN time complexity. In addition, selectOrSplit is called for all the N siblings sequentially. When reg pressure is high, a large percentage of siblings will be spilled (let's suppose N/2 siblings will be spilled), and traceSiblingValue will be called N/2 times indirectly from selectOrSplit, then there will be N^3 time complexity in total.

analyzeSiblingValues has two major usages: One is to figure out the SibValueInfo::SpillVNI of the virtReg to be spill so the spill can be hoisted to the place after SpillVNI->def and redundent spills are eliminated at the same time. The other is to trace the sibling copies back to the original value so the computation of the original value can be used for rematerialization. We replace analyzeSiblingValues by reimplementing these functionalities in Part1 and Part2.

Part1:
Instead of figure out the place to hoist spill for each virtReg to be spilled, we do that all at once when allocatePhysRegs is done. With all spills in place, we group spills with the same values (having the same OrigVNI). For each group of the spills with equal values, first we remove redundent spills dominated by other spill in the group, then traverse the dominate tree in post-order and hoist the spills to less frequent dominate tree node. Since spill can be hoisted to a cold dominate tree node without any sibling's VNI->def in the node, it can be better than the original implementation.

I didn't follow Jakub's proposal in pr17409 to change hoistCopiesForSize because redundent backcopies seen in one round of splitting is limited. Suppose Vreg1 is splitted to Vreg2, Vreg3 and Vreg4 in the first round of splitting, and Vreg4 is further splitted to Vreg5 and Vreg6 in the second round of splitting, the redundent backcopies between Vreg2=Vreg3 and Vreg5=Vreg6 cannot be found (I caught more than 100 such cases in llvm testsuites which left redundent spills in the final asm code)

Part2:
To find out the computation of the original value for rematerialization, we always query the inst at OrigVNI->def. To handle the case that the inst at OrigVNI->def has been removed during rematerialization, we change rematerialization to not delete the inst at OrigVNI->def even if it is already dead. In stead, we change the dest vreg to a new vreg (The new vreg will not be reg allocated so it will not affect the allocation of other vregs), save the inst to a set named as DeadRemats, and shrink the original dest vreg in the same way as previous. The insts in DeadRemats will be removed after allocatePhysRegs is done.

Part3:
The Part3 of the change is to clean up the code related with analyzeSiblingValues.

Test with all three parts combined on x86_64-linux-gnu:

  1. The compile time for 1.c in pr24618 dropped from 0.34s to 0.25s The compile time for interpreter_goto_table_impl.ii in pr24618 dropped from 176.80s to 66.86s. I cannot verify the patch using tests in pr17409 because most bugs related with asan/ubsan have been workaround on sanitizer side.
  1. llvm testsuites. Perf: mostly neutral except one perf regression I havn't addressed: SingleSource/Benchmarks/Misc/mandel. The reason has been understood. It is because we didn't do local spill hoist which the original implementation did. The usage of the local hoist is described in the comment in propagateSiblingValue (Starting with // This is an alternative def earlier in the same MBB.) The local hoist cannot be done after allocatePhysRegs so I will address it in a separate patch. CC time: MultiSource/Applications/sqlite3/sqlite3 has 1.5% improvement steadily.
  1. I havn't cleaned up the llvm unit tests because I expect there will be many changes to the patches during the review. I will clean them up later.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
wmi updated this revision to Diff 42080.Dec 7 2015, 1:51 PM
wmi updated this revision to Diff 44027.Jan 5 2016, 11:01 AM
  1. Change std::set to SmallPtrSet and DenseSet.
  2. Improve comments.
qcolombet edited edge metadata.Jan 6 2016, 1:07 PM

Hi Wei,

Thanks for working on this.

I haven’t looked into it yet, but I wanted to let you know this is on my todolist.

Thanks for your patience,
-Quentin

qcolombet requested changes to this revision.Jan 8 2016, 3:00 PM
qcolombet edited edge metadata.

Hi Wei,

I had a quick look at the patch and although I believe it does the job done, I do not think this is the way to fix the problem.
My understanding is that you are trading an expensive value tracking mechanism with a fancy, but apparently less expensive, spill placement mechanism.

I think it is not the right approach because this creates a (bigger) gap between the cost model of the register allocator and the actual spill cost. Indeed, the cost model of the register allocator, w.r.t. spill cost, is basically reload before the uses, spill after the definitions.
In other word, it is better to keep the spiller simple but be smarter on the splitting of live-ranges so that the register allocator takes the right spilling/splitting decisions. That way, sharing spills/reloads will come naturally without to do anything in the spiller plus we may get better copies placement.

This was also, I believe, what Jakob had in mind when he described a solution in PR17409.

Concretely, what you should do:
0. Create a baseline for performance comparisons without any changes

  1. Add an option to disable the InlineSpiller::analyzeSiblingValues, say -disable-spill-analyze-sibvalue
  2. Benchmark with (-mllvm) -disable-spill-analyze-sibvalue (-mllvm) -split-spill-mode=size (we may want to use “speed" and improve that splitting mode instead of “size")
  3. Investigate the regressions and/or file PRs
  4. Fix the regressions

At this point, the new (or size) split mode should be better or equivalent to the baseline and we can just kill analyzeSiblingValue and make that new mode the default.

Hope that helps.

Cheers,
-Quentin

This revision now requires changes to proceed.Jan 8 2016, 3:00 PM
wmi added a comment.Jan 8 2016, 5:02 PM

Thank you for the review.

I think it is not the right approach because this creates a (bigger) gap between the cost model of the register allocator and the actual spill cost. Indeed, the cost model of the register allocator, w.r.t. spill cost, is basically reload before the uses, spill after the definitions.
In other word, it is better to keep the spiller simple but be smarter on the splitting of live-ranges so that the register allocator takes the right spilling/splitting decisions. That way, sharing spills/reloads will come naturally without to do anything in the spiller plus we may get better copies placement.

This was also, I believe, what Jakob had in mind when he described a solution in PR17409.

I have some different ideas here because of the following three things:

  1. I tried Jakob's solution when I just started to work on this problem, but quickly I realized a fundamental problem there compared with the existing way using InlineSpiller::analyzeSiblingValues. For Jakob's solution, it can only see the spills generated from siblings splitted from current VirtReg, .i.e, in current round of selectOrSplit. Let's say current VirtReg R1 is splitted into siblings: R2, R3, R4 (Suppose R2 is for the remainder interval). Suppose R4 is further splitted in the next round of selectOrSplit and a new remainder interval R5 is generated. If a spill for R2 dominate a spill for R5, Jakob's solution cannot remove the spill for R5 because they are generated in different rounds of selectOrSplit. InlineSpiller::analyzeSiblingValues doesn't have the issue and that is exactly why InlineSpiller::analyzeSiblingValues has to pay so much cost to track the siblings with equal values.

    To evaluate how serious the problem above is, I did some experiment using Jakob's solution. I wrote a sanity check to caught redundent spills left over in the final stage (To make sure no later phase will clean up the redundent spills) because of the issue above, and I caught such cases in about 100 files when building llvm testsuite. And when I had the solution in this patch ready, I also used the sanity check and have ensured all those redundent spills have been cleaned up.
  1. In Jakob's solution, the spills sharing work is done mostly in func SplitEditor::hoistCopiesForSize. hoistCopiesForSize is called inside SplitEditor::finish (last step of splitting) .i.e, it hoist spills and removes redundent spills after the splitting decision has been done for the current VirtReg. So for Jakob's solution, it has the same cost model issue as what you described here.
  1. For my solution, the major func is called after RegAllocBase::allocatePhysRegs. That is to say it keeps the reload/spills in the original places (reload before the uses, spill after the definitions) during RegAllocBase::allocatePhysRegs, and only try to share/hoist spills after most of the regalloc work is done. So I think my solution seems to be more close to your idea here. It is like a cleanup work after regalloc, which has simpler logic and clearer impact which will be easier for performance tuning. In comparison, Jakob mentioned in PR17409 that enabling either split spill mode is going to affect the live range splitting algorithm a lot, and somebody has to track down the regressions and fix them.

Concretely, what you should do:
0. Create a baseline for performance comparisons without any changes

  1. Add an option to disable the InlineSpiller::analyzeSiblingValues, say -disable-spill-analyze-sibvalue
  2. Benchmark with (-mllvm) -disable-spill-analyze-sibvalue (-mllvm) -split-spill-mode=size (we may want to use “speed" and improve that splitting mode instead of “size")
  3. Investigate the regressions and/or file PRs
  4. Fix the regressions

At this point, the new (or size) split mode should be better or equivalent to the baseline and we can just kill analyzeSiblingValue and make that new mode the default.

Actually when I was working on the patch, I followed your steps here to improve it gradually by comparing with existing implementation and fixing regressions.

Another thing is: No matter which solution is adopted in the end, Part2 is needed because of there is no way to get DefMI for rematerialization after removing InlineSpiller::analyzeSiblingValues.

Thanks,
Wei.

Hi Wei,

Couple of comments:

#1

  • Out of curiosity, do you have numbers of how many redundant spills we have for the current solution?
  • I am not saying the live-range splitting mechanism is perfect, but it fits nicely to the exiting framework, in particular w.r.t. the way we model the spill cost. Which leads me to #2.

#2

  • hoistCopiesForSize is for Copies AFAIR, i.e., split points not spills. That means that, IIRC, we have the save cost model, since we do not care about split insertion in the cost.

#3

  • I agree the spill hoisting thing is more like a post reg alloc phase. Therefore, I would rather have it a post regalloc pass instead of embedded in the spiller. That being said, I understand this is easier to directly work with the virtual registers.

To summarize, this is fine to have a the spill hoisting optimization where you put it. I believe though that making the splitting smarter would be the first logical step to mitigate the need for such optimization and I am still concerned that it may be bad for compile time.

If you’d like to pursue into that direction anyway, it is okay, just couple of inlined comments.

As for path part 2, it does not do anything at the moment does? (I.e., we clear the set before walking through it).

Thanks,
-Quentin

include/llvm/CodeGen/VirtRegMap.h
66 ↗(On Diff #44027)

Is there a way this could be computed from the split map?

lib/CodeGen/InlineSpiller.cpp
161 ↗(On Diff #44027)

Can set private, right?

162 ↗(On Diff #44027)

Ditto.

164 ↗(On Diff #44027)

Ditto.

168 ↗(On Diff #44027)

Ditto.

lib/CodeGen/Spiller.h
34 ↗(On Diff #44027)

Call this method postOptimization and make it a non-abstract method.
We do not want the spillers existing out of tree to have to add a default implementation whereas they do not need to do anything.

wmi added a comment.Jan 11 2016, 12:46 PM

#1

  • Out of curiosity, do you have numbers of how many redundant spills we have for the current solution?
  • I am not saying the live-range splitting mechanism is perfect, but it fits nicely to the exiting framework, in particular w.r.t. the way we model the spill cost. Which leads me to #2.

I did that experiment -- trying to catch redundent spills in current analyzeSiblingValues solution. I did catch some in llvm testsuite(~20, I don't remember exactly), but most of them are left there because of the HoistCondition checking in func propagateSiblingValue, .i.e, because of HoistCondition, current solution left some not very important but fully redundent spills in the code.

#2

  • hoistCopiesForSize is for Copies AFAIR, i.e., split points not spills. That means that, IIRC, we have the save cost model, since we do not care about split insertion in the cost.

Could you elaberate what the save cost model mean here?

#3

  • I agree the spill hoisting thing is more like a post reg alloc phase. Therefore, I would rather have it a post regalloc pass instead of embedded in the spiller. That being said, I understand this is easier to directly work with the virtual registers.

To summarize, this is fine to have a the spill hoisting optimization where you put it. I believe though that making the splitting smarter would be the first logical step to mitigate the need for such optimization and I am still concerned that it may be bad for compile time.

I compared the compile time between Jakob's solution (-disable-spill-analyze-sibvalue + -split-spill-mode=size) and my solution for some motivational testcases. They are the same. I will do more careful tests on this side, like using spec.

If you’d like to pursue into that direction anyway, it is okay, just couple of inlined comments.

As for path part 2, it does not do anything at the moment does? (I.e., we clear the set before walking through it).

The set DeadRemats is used to record def instructions which should have been removed when they are found to be dead after rematerialization. However, the def may still be useful for rematerialization of other siblings (Note without DefMI setting in analyzeSiblingValues, for all the siblings with equal values, the original_register_VNI->def is the best place to query the value expression. If the original def is deleted, we have no place to query value expression for rematerialization of siblings in the following rounds of selectOrSplit). So we decide to keep the dead instructions in their original places during the whole lifetime of allocatePhysRegs and use the set of DeadRemats to hold them (Changing the dest reg to a new dummy reg which will never be added to NewVRegs, so the live range can be updated properly in the same way as before). Those dead defs in DeadRemats are deleted after allocatePhysRegs is done.

wmi updated this revision to Diff 46011.Jan 26 2016, 10:20 AM
wmi edited edge metadata.

I noticed a weakness of my hoistSpill patch: When a redundent spill is deleted, the RHS register may become dead and its live range can be shrinked. However, hoistSpill is done after register assignment so it cannot utilize the benefit of live range shrinking caused by deleting redundent spills. I also caught some testcases producing non-optimal code because of it.

To solve it, the best way now I can think of is to combine -split-spill-mode=size (Jakob's solution) and the hoistSpill patch here. So common cases of redundent spills can be deleted by -split-spill-mode=size during register allocation and redundent spills generated from different splitting rounds will be cleaned up by hoistSpill patch here. This combined way generated the best code from my analysis of llvm testsuites.

About compile time, I used spec2006 C/C++ benchmarks to do the evaluation.
hoistSpill + -split-spill-mode=size compared with base: -0.70% compile time decrease on average.
hoistSpill + -split-spill-mode=size compared with -split-spill-mode=size only: +0.18% compile time increase on average.

performance is generally neutral for llvm-testsuite and google benchmarks. except SingleSource/Benchmarks/Misc/mandel's degradation caused by problem1 below.

I reevaluated performance for hoistSpill + -split-spill-mode=size for llvm testsuite and google benchmarks. Generally they are neutral compared with trunk, except SingleSource/Benchmarks/Misc/mandel. mandel degradation is caused by problem1 below.

Other changes:
Addressed Quentin's comments. Code reorganized -- add a HoistSpiller class. Fix some bugs when -regalloc=pbqp and -regalloc=basic are used. Fix unit tests and add new unit tests.

problems unaddressed:

  1. Spill hoist inside BB. propagateSiblingValue has a good description about its benefit in the comment, and I found testcases generating non-optimal code without it. I plan to address it separately.
  1. I deleted CodeGen/AArch64/aarch64-deferred-spilling.ll but I havn't got a good replacement for it. With the hoistSpill + -split-spill-mode=size patch, the pattern checked by the test doesn't appear anymore, and the test is relatively large so it is not easy to look closely whether it is just transformed to another appearance. I did see many cases that defer spills can get phyregs in the end and I got a few small testcases on x86. However, those are still somewhat fragile -- I found when I changed regalloc a little bit, the pattern disappeared.

Thanks,
Wei.

wmi marked 6 inline comments as done.Jan 26 2016, 10:26 AM
wmi added inline comments.
include/llvm/CodeGen/VirtRegMap.h
66 ↗(On Diff #44027)

Yes, I removed Virt2SiblingsMap and computed it from split map.

wmi updated this revision to Diff 46392.Jan 29 2016, 9:48 AM
wmi edited edge metadata.
wmi marked an inline comment as done.

Add SM_Speed split mode and use it as default. SM_Size sometimes will hoist spills from cold region in inner loop to hot region in outer loop, which is bad for performance. SM_Speed will only try to hoist spills from hot region to cold region. If it fails to hoist all the spills to a cold place, step back and remove spills dominated by others.

Compare "hoistSpill + split-spill-mode=speed" with "hoistSpill + split-spill-mode=size", an internal benchmark gets 1.5% improvement. llvm testsuite has no perf change.

Hi Mikhail,

Could you check how this patch impacts our performance?

Thanks,
-Quentin

wmi updated this revision to Diff 46399.Jan 29 2016, 10:48 AM

Hi Mikhail,
Could you check how this patch impacts our performance?

Thanks. This is the patch "hoistspill + split-mode-speed [Part1] + redo rematerialize [Part2] + remove analyzeSiblingValues [Part3]" merged together, which I used to do the performance testing.
// I divided the patches for easier review. To merge the separate patches together, it needs to resolve some conflicts.

Hi Wei,

Could you rebase your patches, they do not apply cleanly for me.

Also, while I am here, a couple of inlined comments :).

Cheers,
-Quentin

include/llvm/CodeGen/LiveRangeEdit.h
148 ↗(On Diff #46399)

This seems strange that the API allows to drop some of the new registers.
At the very least, we should document (i.e., put explanatory comments) why this is useful and why it is okay to drop such references.
In general, it should not.

184 ↗(On Diff #46399)

Instead of having at additional field which may be the same as ParentVNI in a lot of cases, this one could be computed.
Then, if this has some performance problem, we can think of a better caching mechanism.

235 ↗(On Diff #46399)

Replace 'it' by this live interval or something.
The context is now high in the source file and repeating it wouldn't hurt IMO.

lib/CodeGen/InlineSpiller.cpp
56 ↗(On Diff #46399)

Since this class does not inherit from Spiller, what about naming it HoistSpillHelper or something.

wmi added a comment.Mar 14 2016, 12:01 PM

I am rebasing the patch now.

Wei.

wmi updated this revision to Diff 50643.Mar 14 2016, 2:00 PM

Rebase the merged patch to r262808. Patch verified using llvm testsuite on x86-64 and qemu-aarch64.

qcolombet requested changes to this revision.Mar 15 2016, 5:37 PM
qcolombet edited edge metadata.

Hi Wei,

The benchmarks are still running, but so far, the numbers look good.

Anyhow, I finally made time for the review.

Generally speaking this looks almost good to me. The quadratic behavior of the first loop in runHoistAllSpills scares me and we need to look for a better alternative.
Moreover we need better comments for the APIs.
Finally, the tests change with more moves are worrisome. Could you explain why this happens and how we will fix that?
It seems to me we are choosing an insertion point for the store that happens too late.

I have highlighted all my concerns with the inline comments.

Cheers,
-Quentin

include/llvm/CodeGen/LiveRangeEdit.h
123 ↗(On Diff #50643)

Put that as at the end of the list with nullptr as default parameter.

221 ↗(On Diff #50643)

Maybe just say that DeadRemats is an optional field.
Mentioning Greedy here does not bring any value IMO.

lib/CodeGen/InlineSpiller.cpp
74 ↗(On Diff #50643)

mergeable

79 ↗(On Diff #50643)

Do not repeat the name of the field in the comment.

81 ↗(On Diff #50643)

[…] as the source (instead of RHS) of the new ..

82 ↗(On Diff #50643)

How big are the sets?
I would expect very few siblings on average and was wondering if a SmallSetVector or SmallSet would be more appropriate.

85 ↗(On Diff #50643)

Please use reference for values that cannot be nullptr. I.e., OrigVNI and BB.

90 ↗(On Diff #50643)

SpillsToKeep

213 ↗(On Diff #50643)

Hid the instantiation of the hoist spiller helper in the inliner spiller.
The positive side effect is that we won’t leak the memory!

223 ↗(On Diff #50643)

Hid the call to the hoist spiller helper in the inliner spiller.

998 ↗(On Diff #50643)

.i.e => i.e.

998 ↗(On Diff #50643)

Use doxygen style comment.

1010 ↗(On Diff #50643)

DenseSet does not guarantee that the iteration order is stable from one run to another, does it?

Although we should have several siblings live at the same time, this is theoretically possible.
In other words, we should use a container that has a deterministic iteration order for reproducibility. See the earlier suggestion I made.

1024 ↗(On Diff #50643)

This method would benefit a more verbose comment.
Maybe something along the line:
Starting from \p Root find a top-down traversal order of the dominator tree to visit all basic blocks containing the elements of \p Spills.
Redundent spills will be found and put into \p SpillsToRm at the same time.
\p SpillBBToSpill will populated as part of the process and maps a basic block to the first store occurring in the basic block.

\post SpillToRm.union(Spills@post) == Spills@pre

What is the usage of SpillsToKept? In particular the unsigned part?

Should we consider to moving some of the arguments to field of the current hoistspill instance?

1024 ↗(On Diff #50643)

This method does a bunch of things!
Although I understand we want to share the logic that does the traversal and such, I found that it makes the code harder to read.
I’d say as it is now and with more comment like I suggested, this is fine, but in general I rather have a better separation of concerns then try to optimize if it turns out to be problematic.
I am guessing you already went through that process, we are just lacking the history :).

1043 ↗(On Diff #50643)

Any chance this could be updated when we insert the spill?
Like I said, it just feels like getVisitOrders does too many things.

1051 ↗(On Diff #50643)

Please document what is WorkSet supposed to contain.

1053 ↗(On Diff #50643)

I think we should describe what is the expected root, because it seems strange to me that we don’t just take the node for the Root.

1065 ↗(On Diff #50643)

More comments, e.g.,
Node dominates Block and already store the value.
This store is redundant.

1078 ↗(On Diff #50643)

Ok, found the meaning of the unsigned elsewhere…
A comment here as well would be great.

1098 ↗(On Diff #50643)

Assert Orders.size == WorkSet.size?

1120 ↗(On Diff #50643)

We do not insert the original store, it is already there, right?

1135 ↗(On Diff #50643)

I believe we usually use bottom-up instead of bottom-top.

1137 ↗(On Diff #50643)

have

1174 ↗(On Diff #50643)

If the subtrees get big, we will end up recomputing this cost a bunch of time.
Could it be something we keep alongside the subtree?

1181 ↗(On Diff #50643)

We could add a mode for the hoist spiller, where code size is the priority. I.e., always hoist when SpillsInSubTree.size > 1
A follow-up patch is fine.

1208 ↗(On Diff #50643)

typo.

1210 ↗(On Diff #50643)

Variable must start with a capital letter.
Also why use ent for the name?

1218 ↗(On Diff #50643)

Explain the general algorithm here.

1234 ↗(On Diff #50643)

This loop scares me.
Any chance this information can be built as we insert spill.

1242 ↗(On Diff #50643)

empty

lib/CodeGen/RegAllocGreedy.cpp
401 ↗(On Diff #50643)

I would have put this into the base class.

2571 ↗(On Diff #50643)

spiller().postOptimization()

2597 ↗(On Diff #50643)

Should be created within the inline spiller. See my comment on createInlineSpiller.

lib/CodeGen/RegAllocPBQP.cpp
130 ↗(On Diff #50643)

This should be a separate patch.

156 ↗(On Diff #50643)

Ditto.

731 ↗(On Diff #50643)

We shouldn’t have to touch this.

lib/CodeGen/RegisterCoalescer.cpp
463 ↗(On Diff #50643)

We shouldn’t have to touch this.

lib/CodeGen/Spiller.h
38 ↗(On Diff #50643)

add a bool here that default to false for using a post optimization.

46 ↗(On Diff #50643)

Get rid of those.

lib/CodeGen/SplitKit.cpp
727 ↗(On Diff #50643)

Please don’t repeat the comment from the declaration.

742 ↗(On Diff #50643)

i.e.

749 ↗(On Diff #50643)

We should we start iterating with the next iterator instead of starting over.
The next call to count should early continue the loop but still!

1134 ↗(On Diff #50643)

Add a comment saying that hoistCopies will behave differently between size and speed, otherwise it feels like those modes are the same.

lib/CodeGen/SplitKit.h
333 ↗(On Diff #50643)

Don’t repeat the name of the method.

335 ↗(On Diff #50643)

Given how this is used, the actual name of this method should be computeRedundentBackCopies.

338 ↗(On Diff #50643)

Ditto.

test/CodeGen/X86/vselect-minmax.ll
4898 ↗(On Diff #50643)

Why is this happening?

7620 ↗(On Diff #50643)

Ditto.

This revision now requires changes to proceed.Mar 15 2016, 5:37 PM
wmi updated this revision to Diff 51175.Mar 21 2016, 9:45 AM
wmi edited edge metadata.

Quentin, I addressed most of your comments.

Major changes or changes needs to pay attention to:

  1. Added the patch that hoist Spill inside of BB to earlier place when the src of the spill is killed. It is done in InlineSpiller::hoistSpillInsideBB. With this part of change, some test changes are removed.
  1. I am not sure I made the exact change as you expect about where to put postOptimization.
  1. I found there was a comment in my previous patch saying DeadRemat is non-null only when regalloc is Greedy. It was wrong. All kinds of register allocator share the same InlineSpiller logic, so DeadRemat and original eliminateDeadRemats (Now it is put into RegAllocBase::postOptimization and RegAllocPBQP::postOptimization) are also necessary for PBQP and Basic.

I also noticed hoistCopies can be improved further. I plan to address it in the following patch.
With HoistSpillHelper, We still need hoistCopies when split-spill-mode=Speed because after removing some redundent spills, the sources of those spills may be shrinked. But when I addressed the review comments, I also found there is case that removing redundent spills not only cannot shrink the source of redundent spills, but also lengthen the live range of dst of those spills. Since we don't depend on hoistCopies to remove redundent spills (HoistSpillHelper can do that work better), we can change hoistCopies to remove spills only when it can shrink the source of redundent spills or at least not lengthen the live range of dst of spills. I can possibly do that by removing hoistCopies and extend hoistSpillInsideBB to handle cases across BBs -- to hoist spill only when its source is killed.

wmi added inline comments.Mar 21 2016, 9:46 AM
lib/CodeGen/SplitKit.cpp
742 ↗(On Diff #50643)

Fixed.

wmi added inline comments.Mar 21 2016, 9:46 AM
include/llvm/CodeGen/LiveRangeEdit.h
123 ↗(On Diff #50643)

Fixed.

148 ↗(On Diff #50643)

I Added comments to explain it.
In short, we don't want to allocate phys register for the dummy register used as temporary dst register of instruction in DeadRemats set.

184 ↗(On Diff #50643)

You are right. I don't have to save OrigVNI in struct Remat. Instead, add a parameter VNInfo *OrigVNI for LiveRangeEdit::canRematerializeAt.

221 ↗(On Diff #50643)

Fixed.

235–239 ↗(On Diff #50643)

Fixed.

lib/CodeGen/InlineSpiller.cpp
79 ↗(On Diff #50643)

Fixed

81 ↗(On Diff #50643)

Fixed

82 ↗(On Diff #50643)

Most of the cases the size of it is less than 16 I guess, so I use SmallSetVector instead. I cannot use SmallSet because the set needs to be iterated.

85 ↗(On Diff #50643)

Fixed.

90 ↗(On Diff #50643)

Fixed.

213 ↗(On Diff #50643)

Make HoistSpillHelper a field in inline spiller.

223 ↗(On Diff #50643)

I Added postOptimization as a pure virtual func in class Spiller, and put the code of hoist spiller helper inside of InlineSpiller::postOptimization.

998 ↗(On Diff #50643)

Fixed.

998 ↗(On Diff #50643)

Fixed.

1010 ↗(On Diff #50643)

Yes, when turn on split-mode-size, after hoistCopies, it is possible that several siblings live at the same time. it is not just therotically possible but realistic.

I use SmallSetVector instead here.

1024 ↗(On Diff #50643)

Thanks for your example comment. It is good. I copy most of them to the code.

Should we consider to moving some of the arguments to field of the current hoistspill instance?

hoistSpillHelper now has the same lifetime as InlineSpiller instance, so the lifes of those arguments are much shorter than that. That is why I choose to keep them as func local objects.

1024 ↗(On Diff #50643)

After separate part of the work into HoistSpillHelper::rmRedundentSpills and add more comments in the func body, it may make the code easier to read now.

1043 ↗(On Diff #50643)

I separate the first part of the work to another func: HoistSpillHelper::rmRedundentSpills, so HoistSpiller::getVisitOrders is more focus on what its name describes.

1051 ↗(On Diff #50643)

Done.

1053 ↗(On Diff #50643)

Done.

1065 ↗(On Diff #50643)

Done.

1078 ↗(On Diff #50643)

Done.

1098 ↗(On Diff #50643)

Done.

1120 ↗(On Diff #50643)

Yes, that is right.

1135 ↗(On Diff #50643)

Fixed.

1137 ↗(On Diff #50643)

Fixed.

1174 ↗(On Diff #50643)

I keep it along side the subtree in SpillsInSubTreeMap.

1181 ↗(On Diff #50643)

Ok, I will do it in a follow-up patch.

1208 ↗(On Diff #50643)

Fixed.

1210 ↗(On Diff #50643)

Fixed.

1218 ↗(On Diff #50643)

Done.

1234 ↗(On Diff #50643)

I simply remove the inner loop. SlotToOrigReg map will become somewhat bigger, but not a lot.

1242 ↗(On Diff #50643)

Fixed.

lib/CodeGen/RegAllocGreedy.cpp
401 ↗(On Diff #50643)

Done.

2571 ↗(On Diff #50643)

Done.

2597 ↗(On Diff #50643)

Done.

lib/CodeGen/RegAllocPBQP.cpp
130 ↗(On Diff #50643)

InlineSpiller is shared by all kinds of register allocator, so the DeadRemats logic is also needed by PBQP. If I separate the part out, I need to fix the related unit test.

731 ↗(On Diff #50643)

My original comment that DeadRemats is non-empty only when the regalloc is Greedy is wrong. Actually, InlineSpiller and related Remat logic are shared by all register allocators.

And RegAllocPBQP is not a subclass of RegAllocBase, so the code is needed.

lib/CodeGen/RegisterCoalescer.cpp
463 ↗(On Diff #50643)

Fixed.

lib/CodeGen/Spiller.h
38 ↗(On Diff #50643)

I don't get the intention to add a bool here. Is it used to guard post optimization? why it is needed?

46 ↗(On Diff #50643)

Done.

lib/CodeGen/SplitKit.cpp
727 ↗(On Diff #50643)

Comment removed.

749 ↗(On Diff #50643)

Fixed.

1134 ↗(On Diff #50643)

Added.

lib/CodeGen/SplitKit.h
333 ↗(On Diff #50643)

Fixed.

335 ↗(On Diff #50643)

Fixed.

338 ↗(On Diff #50643)

Fixed.

test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
24 ↗(On Diff #51175)

This is change confusing to me, because if we only use 254 VGPRs then there shouldn't be any spills, but there are still spill instructions being emitted. It seems like this is probably a bug, but I will need to look at it more closely to see if it is an AMDGPU bug or a generic regalloc bug.

wmi added a comment.Mar 21 2016, 11:06 AM

I noticed that even without my change, although compiler output "GCN:
NumVgprs is 256", when I looked at the trace of -debug-only=regalloc,
I found there were some VGPR unused.

Here is what I did:
~/workarea/llvm-r262808/dbuild/./bin/llc -march=amdgcn -mcpu=tahiti
-mattr=+vgpr-spilling -verify-machineinstrs <
~/workarea/llvm-r262808/src/test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
-debug-only=regalloc >/dev/null 2>out1

Delete the trace from out1 before the section of "REGISTER MAP", then
execute the command below:
for ((i=0; i<256; i++)); do

grep "VGPR$i[^0-9]" out1 &>/dev/null
if [[ "$?" != "0" ]]; then
  echo VGPR$i
fi

done

The output is:
VGPR40
VGPR189
VGPR190

So even if the compiler says GCN: NumVgprs is 256, there are three
VGPRs never used.

Thanks,
Wei.

Hi Wei,

I believe we are almost done. Thanks for your work and patience on this.

There are mainly three items to address:

  • There are typos widely spread in the file; mergable -> mergeable, redundent -> redundant
  • Do not repeat method names on comment.
  • Fix on the test cases. See the inline comment.

As for the benchmarking, almost all the diffs came back as improvement of up to 7%! This is impressive.
The regressions seem like side effect, i.e., we generate less load pair in a few case, because the related spill slots are not next to each other anymore. This was luck previously.

Anyhow, looking forward for the final fix-ups.

Cheers,
-Quentin

include/llvm/CodeGen/LiveRangeEdit.h
77 ↗(On Diff #51175)

Switch to SmallPtrSetImpl, this the size of the type is not relevant.

152 ↗(On Diff #51175)

Don’t repeat the method name in the comment.

lib/CodeGen/InlineSpiller.cpp
76 ↗(On Diff #51175)

Mergeable.
Do a search, the typo is widely spread :).

326 ↗(On Diff #51175)

Don’t repeat the method name.

1040 ↗(On Diff #51175)

Don’t repeat the name of the method.

1051 ↗(On Diff #51175)

More mergeable typos...

lib/CodeGen/LiveRangeEdit.cpp
382 ↗(On Diff #51175)

Some update problem I believe.

lib/CodeGen/RegAllocBase.cpp
159 ↗(On Diff #51175)

Capitale letter for the first letter of the variable name.

lib/CodeGen/RegAllocBase.h
88 ↗(On Diff #51175)

Add virtual keyword.
Subclasses may want to do additional things.

lib/CodeGen/RegAllocPBQP.cpp
727 ↗(On Diff #51175)

Variables start with a capital letter.

lib/CodeGen/Spiller.h
32 ↗(On Diff #51175)

Other spillers out-of-tree may exist and there is little interest in having them to implement a post optimization method if they do not need it.
In other words, instead of a pure virtual method, do nothing for the default implementation.

39 ↗(On Diff #51175)

I was thinking in case we want to test without the post-optimization.
But I am fine if it is always enabled.

lib/CodeGen/SplitKit.h
335 ↗(On Diff #51175)

Typo: redundant

test/CodeGen/X86/hoist-spill.ll
1 ↗(On Diff #51175)

Make this a file check test.

115 ↗(On Diff #51175)

Get rid of the attributes if they are not actually needed.

test/CodeGen/X86/new-remat.ll
12 ↗(On Diff #51175)

Use opt -instnamer to get rid of the %[0-9]+ variables.

In D15302#379497, @wmi wrote:

I noticed that even without my change, although compiler output "GCN:
NumVgprs is 256", when I looked at the trace of -debug-only=regalloc,
I found there were some VGPR unused.

Here is what I did:
~/workarea/llvm-r262808/dbuild/./bin/llc -march=amdgcn -mcpu=tahiti
-mattr=+vgpr-spilling -verify-machineinstrs <
~/workarea/llvm-r262808/src/test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
-debug-only=regalloc >/dev/null 2>out1

Delete the trace from out1 before the section of "REGISTER MAP", then
execute the command below:
for ((i=0; i<256; i++)); do

grep "VGPR$i[^0-9]" out1 &>/dev/null
if [[ "$?" != "0" ]]; then
  echo VGPR$i
fi

done

The output is:
VGPR40
VGPR189
VGPR190

So even if the compiler says GCN: NumVgprs is 256, there are three
VGPRs never used.

NumVgprs is the number of VGPRs that need to be allocated for the program, so the fact that there are gaps doesn't matter (though this is strange). If you use only register v255, you still need to allocate all 256 registers.

wmi updated this revision to Diff 51245.Mar 21 2016, 4:52 PM
wmi edited edge metadata.

Fix all the comments Quentin suggested. Thanks for the careful review.

lib/CodeGen/InlineSpiller.cpp
76 ↗(On Diff #51175)

Fixed.

326 ↗(On Diff #51175)

Fixed.

1040 ↗(On Diff #51175)

All similar comments fixed.

1051 ↗(On Diff #51175)

All such typos Fixed.

lib/CodeGen/LiveRangeEdit.cpp
382 ↗(On Diff #51175)

Fixed.

lib/CodeGen/RegAllocBase.cpp
159 ↗(On Diff #51175)

Fixed.

lib/CodeGen/RegAllocBase.h
88 ↗(On Diff #51175)

Fixed.

lib/CodeGen/RegAllocPBQP.cpp
727 ↗(On Diff #51175)

Fixed.

lib/CodeGen/Spiller.h
32 ↗(On Diff #51175)

Make sense. Fixed.

39 ↗(On Diff #51175)

Ok, I leave it there for now.

lib/CodeGen/SplitKit.h
335 ↗(On Diff #51175)

Fixed here and many other places.

test/CodeGen/X86/hoist-spill.ll
1 ↗(On Diff #51175)

I felt the file check test was not as general as the above test, but filecheck can still work, so I switch to file check here.

115 ↗(On Diff #51175)

Fixed.

test/CodeGen/X86/new-remat.ll
12 ↗(On Diff #51175)

Fixed.

Hi Wei,

I think we will need to wait for Tom to double check what happened for AMDGPU.

One question though, this revision ended up being the combination of the 3 parts, right?

Cheers,
-Quentin

test/CodeGen/X86/hoist-spill.ll
2 ↗(On Diff #51245)

You could check where the spills actually are. But it already looks pretty good now :).

wmi added a comment.Mar 29 2016, 9:47 AM

So even if the compiler says GCN: NumVgprs is 256, there are three
VGPRs never used.

NumVgprs is the number of VGPRs that need to be allocated for the program, so the fact that there are gaps doesn't matter (though this is strange). If you use only register v255, you still need to allocate all 256 registers.

Hi Tom,

I found with my patch here, the Spill num for the testcase increases
from 68 to 152, and Reload num increases from 72 to 188. I havn't
throughly understood what is wrong here, but I can roughly describe
how the problem happen and say it may be a problem of local splitting,
instead of my patch.

In the testcase, there are roughly 64 VReg_128 vars overlapping with
each other consuming all the 256 VGPRs and some other scattered VGPR
uses. Each VReg_128 var occupies 4 consecutive VGPRs, so VGPR
registers are allocated in this way: vreg1: VGPR0_VGPR1_VGPR2_VGPR3;
vreg2: VGPR4_VGPR5_VGPR6_VGPR7; ......

Because we have some other scattered VGPR uses, we cannot allocate all
the 64 VReg_128 vars in register, so splitting is needed. region
splitting will not bring trouble because it only tries to fill holes,
i.e., vregs after the splitting usually will not evict other vregs.
local splitting can bring a lot of mess to the allocation here.
Suppose it tries to find a local gap inside BB to split vreg3
(VReg_128 type). After the local split is done, vreg3 will be splitted
into vreg3-1 and vreg3-2. vreg3-1 and vreg3-2 have short live ranges
so both of them have relatively larger weight. vreg3-1 may find a hole
and is allocated to VGPR2_VGPR3_VGPR4_VGPR5, then vreg3-2 will get a
hint of VGPR2_VGPR3_VGPR4_VGPR5 and will evict vreg1
(VGPR0_VGPR1_VGPR2_VGPR3) and vreg2 (VGPR4_VGPR5_VGPR6_VGPR7) above.
To find consecutive VGPRs for vreg1 and vreg2, reg alloc will do more
region splitting/local splitting and more evictions, and causes more
and more vregs hard to find consecutive VGPRs.

With my patch, it will add one more VReg_128 interval during splitting
because of hoisting (This is a separate problem I described in a TODO
about improving hoistCopies in previous reply). To allocate the
VReg_128 var, it triggers more region splitting and local splitting,
and makes more vars spilled.

To show the problem, I experimentally turn off local splitting for
trunk without my patch, the Spill num for the testcase drops from 68
to 56, and Reload num drops from 72 to 36. When turn off local
splitting for trunk with my patch, the Spill num for the testcase
drops from 152 to 24, and Reload num drops from 188 to 24.

So this is probably a separate issue for architecture using
consecutive combined registers for large data type.

Thanks,
Wei.

Hi Tom,

Do you think the issue is a blocker for this patch or a separated one?
Want to get your confirmation so I can decide how to push the work
forward.

As for using 254 VGPRs instead of 256 VGPRs, I think it just cannot
find 4 consecutive VGPRs for VReg_128 data. The holes in the end (v254
v255) have no difference with holes in the middle. Is it correct?

Thanks,
Wei.

I think your analysis is correct about why it doesn't use all 256 register. I actually hit this same thing in another patch I'm working on. I have to objections to this patch being pushed.

It turns out this was ready just in time: we just noticed that r263460 essentially undermines all of the work to avoid PR17409, and we now have widespread superlinear compile times with sanitizers (and possibly other code).

Just wanted to confirm with you Tom that this LGTM, and encourage Wei to go ahead and land it as soon as Tom acks. =D We have a *bunch* of stuff blocked on the compile time issues here.

One minor nit below.

lib/CodeGen/InlineSpiller.cpp
57 ↗(On Diff #51245)

I get a warning saying this is unused when building with this patched in...

wmi added a comment.Apr 2 2016, 6:19 PM

Thanks for the support of this patch. Looks like Tom's "to objection"
is a typo of "no objection". I will prepare to commit the patch.

Wei.

wmi updated this revision to Diff 52569.Apr 4 2016, 9:46 AM

Fix my mistake introduced when I was addressing the review comments:

I accidentally remove the virtual keyword of postOptimization in lib/CodeGen/Spiller.h. It should not be a pure virtual function, but still should be virtual.

This will fix the warning Chandler saw.

This revision was automatically updated to reflect the committed changes.
wmi retitled this revision from [Greedy regalloc] Replace analyzeSiblingValues with something new [Part1] to [Greedy regalloc] Replace analyzeSiblingValues with something new.Apr 4 2016, 9:49 AM
wmi added a comment.Apr 9 2016, 11:03 AM

Hi Quentin,

Recently, I committed some bug fixes for the patch here without getting approvement first because I think they are relatively trivial. Please give them a postcommit review: http://reviews.llvm.org/D18934

There are another two fixes which are somewhat substantial, which I think needs to be reviewed before commit.
http://reviews.llvm.org/D18935
http://reviews.llvm.org/D18936

Thanks,
Wei.

llvm/trunk/lib/CodeGen/SplitKit.cpp