This is an archive of the discontinued LLVM Phabricator instance.

[RegAlloc] Skip global splitting if the live range is huge and its spill is trivially rematerializable
ClosedPublic

Authored by wmi on Jul 15 2018, 1:51 PM.

Details

Summary

We run into a case where machineLICM hoists a large number of live ranges outside of a big loop because it thinks those live ranges are trivially rematerializable. In regalloc, global splitting is tried out first for those live ranges before they are spilled and rematerialized. Because the global splitting algorithm is quadratic, increasing a lot of global splitting candidates causes huge compile time increase (50s to 1400s on my local machine).

However, we think for live ranges which are very large and are trivially rematerialiable, it is better to just skip global splitting so as to save compile time with little chance of sacrificing performance. We uses the segment size of live range to indirectly evaluate whether the global splitting of the live range can introduce high cost, and use an option as a knob to adjust the size limit threshold.

performance evaluation is ongoing.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Jul 15 2018, 1:51 PM

Given that this only kicks in towards an edge case, I think this is fine to go on into the tree. It solves a very real problem for us, and we can iterate on it with upstream if there are subsequent comments.

For context of others, this fixes an issue we've hit with flex generated code combined with -fno-jump-tables. Sadly, we're using that option a lot more now due to retpolines needing it. This means we end up hitting some weird edge cases like this.

chandlerc accepted this revision.Jul 15 2018, 4:37 PM

Meant to actually say "LGTM" and feel free to submit. Minor tweaks seem fine to happen post-commit.

This revision is now accepted and ready to land.Jul 15 2018, 4:37 PM
This revision was automatically updated to reflect the committed changes.

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

llvm/trunk/lib/CodeGen/RegAllocGreedy.cpp
130

This help string is wrong.

wmi added a comment.Jul 16 2018, 11:48 AM

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

One segment can span multiple basicblocks. I am not sure whether one basicblock can have multiple segments inside of it theoretically, but it is uncommon. So emperically large number of segments mean large number of basicblocks, then large number of edge bundle nodes and high hopfield neural network algorithm cost.

In D49353#1163951, @wmi wrote:

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

One segment can span multiple basicblocks. I am not sure whether one basicblock can have multiple segments inside of it theoretically, but it is uncommon. So emperically large number of segments mean large number of basicblocks, then large number of edge bundle nodes and high hopfield neural network algorithm cost.

Yes that's what I was getting at (just trying to understand the context); the description speaks about a live range being "huge", but just because you have a huge function or a huge basic block does not necessarily trigger this condition. I think right now with the connected component rule in place we can have no more than two segments of the same virtual register inside a basic block (a value living, and a value created possibly living out), any other situation must have been split up into multiple vregs previously. I had the same experience that the bad situations for register allocation compiletime (or copy coalescing for that matter) is automatically generated lexer or parser code with a big number of basic blocks. That usually leads to new value numbers being created at join points increasing the number of liverange segments...

Anyway so far this change looks fine to me, but I'm still waiting for our internal systems to come back with numbers on this change (will probably be 1 or 2 more days until everything has cycled through).

wmi added a comment.Jul 16 2018, 3:25 PM
In D49353#1163951, @wmi wrote:

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

One segment can span multiple basicblocks. I am not sure whether one basicblock can have multiple segments inside of it theoretically, but it is uncommon. So emperically large number of segments mean large number of basicblocks, then large number of edge bundle nodes and high hopfield neural network algorithm cost.

Yes that's what I was getting at (just trying to understand the context); the description speaks about a live range being "huge", but just because you have a huge function or a huge basic block does not necessarily trigger this condition. I think right now with the connected component rule in place we can have no more than two segments of the same virtual register inside a basic block (a value living, and a value created possibly living out), any other situation must have been split up into multiple vregs previously. I had the same experience that the bad situations for register allocation compiletime (or copy coalescing for that matter) is automatically generated lexer or parser code with a big number of basic blocks. That usually leads to new value numbers being created at join points increasing the number of liverange segments...

Exactly as you are saying, copy coalescing is another problem. The patch here only solves the problem partially, and we are still facing the compile time problem of copy coalescing. A trivially rematerializable def instruction with thousands of uses is hoisted outside of a loop in machineLICM. That instruction is rematerialized in copy coalescing by thousands of times for each use, and the live interval update during each time of rematerialization is very costly because of the large live interval.

I wonder how much extra benefit we can get from machineLICM by hoisting so many trivially rematerializable instructions outside of a big loop. I believe most of loop invariant load/store or computation should already be hoisted during LICM phase, so here rematerializable instructions shouldn't enable many other loop invariant load/store/computations. But the extra compile time cost may be significant.

Anyway so far this change looks fine to me, but I'm still waiting for our internal systems to come back with numbers on this change (will probably be 1 or 2 more days until everything has cycled through).