This is an archive of the discontinued LLVM Phabricator instance.

[RegAlloc Greedy]Account statepoints while splitting single basic block
AbandonedPublic

Authored by skatkov on Nov 10 2022, 8:39 PM.

Details

Summary

When Greedy Register Allocator takes a decision whether it makes
sense to split single block it consider multiple uses as must
split condition. However there are instructions which can
accept the operand on stack with the same efficiency as on register.
An example of such instruction is statepoint.

Split single basic block expects that we enter on stack and exit
on stack the basic block. If there are multiple uses it makes sense
to do unsplit before the first use and split after the last use.
If we decide not to split we will do only unsplit before each use
and no spills if there is no defs.

This is exactly a case to cover in this change.
We have only one use which is not a statepoint. So we will do
an unspill before this use and nothing for statepoints.

As a result we avoid to do a redundant spill and reduce the
register preasure in this basic block.

Diff Detail

Event Timeline

skatkov created this revision.Nov 10 2022, 8:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2022, 8:39 PM
skatkov requested review of this revision.Nov 10 2022, 8:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2022, 8:39 PM
Herald added a subscriber: wdng. · View Herald Transcript
arsenm added inline comments.Nov 10 2022, 8:42 PM
llvm/lib/CodeGen/SplitKit.cpp
1584

Early exit after the first one?

llvm/lib/CodeGen/StackMaps.cpp
151

Do you need to filter out implicit-defs? uses doesn't

skatkov edited the summary of this revision. (Show Details)Nov 10 2022, 8:42 PM
arsenm added inline comments.Nov 10 2022, 8:45 PM
llvm/lib/CodeGen/StackMaps.cpp
154

Do you need to account for subregisters?

skatkov added inline comments.Nov 10 2022, 8:46 PM
llvm/lib/CodeGen/SplitKit.cpp
1584

Sorry, do not catch. What do you mean?

llvm/lib/CodeGen/StackMaps.cpp
151

Good question. For explicit defs - they are tied to uses in foldable area, no need to check them.
I doubt statepoint instruction has virtual registers inplicit-defs.
Will check.

skatkov added inline comments.Nov 10 2022, 8:50 PM
llvm/lib/CodeGen/StackMaps.cpp
154

It is about virtual registers. I think no.

skatkov added inline comments.Nov 10 2022, 8:54 PM
llvm/lib/CodeGen/StackMaps.cpp
151

Statepoint instruction is pseudo instruction and does not have implicit-defs in description.
There is a "hack" for aarch64 platform which adds LR physical register as implicit def to statepoint instruction.

But virtual register cannot be implicit def of statepoint instruciton.
So the answer - I do not need to check implicit-defs.

arsenm added inline comments.Nov 10 2022, 8:54 PM
llvm/lib/CodeGen/StackMaps.cpp
154

Only virtual registers have subregisters

skatkov added inline comments.Nov 10 2022, 9:02 PM
llvm/lib/CodeGen/StackMaps.cpp
154

Well, I go to learn about statepoint and sub-registers. Still think I do not need but need to prove :)

skatkov added inline comments.Nov 13 2022, 7:44 PM
llvm/lib/CodeGen/StackMaps.cpp
154

Returned back.
If foldable area contains a use of a reg or its subreg - we will spill it (see InlineSpiller::foldMemoryOperand).
If non-foldable area contains a use of reg or its subreg, there are two cases:

  1. subreg is undef
  2. subreg has a valid def.

For the second case we definitely want to report false.
For the first case - it should be probably replaced earlier to undef, but for simplicity let's just report false as well.

So in my understanding I do not need to account for subregisters.

qcolombet added inline comments.Nov 14 2022, 11:34 AM
llvm/lib/CodeGen/SplitKit.cpp
1622

That's awfully specific for this part of the code.

Could this be handled when rewriting the spills instead with foldMemoryOperand?

skatkov added inline comments.Nov 14 2022, 9:41 PM
llvm/lib/CodeGen/SplitKit.cpp
1622

Hi Quentin, first of all thank a lot for a review and you comment (to Matt as well) but now I feel I need a help.
It looks like I misunderstand the reason why we split single basic block for single use at all.

The commit it was introduced has been done in 2011 with the following comment (and unfortunately without a test):
commit 8627ea91cb097f45894fa80fa04a3fb0100530db
Author: Jakob Stoklund Olesen <stoklund@2pi.dk>
Date: Fri Aug 5 22:20:45 2011 +0000

Split around single instructions to enable register class inflation.

Normally, we don't create a live range for a single instruction in a
basic block, the spiller does that anyway. However, when splitting a
live range that belongs to a proper register sub-class, inserting these
extra COPY instructions completely remove the constraints from the
remainder interval, and it may be allocated from the larger super-class.

The spiller will mop up these small live ranges if we end up spilling
anyway. It calls them snippets.

Split Single basic block is triggered by Greedy Register Allocator in two places:

  1. Region split heuristic decided that we enter the current basic block on stack and exit it as well (if (!IntvIn && !IntvOut)).
  2. Region split failed and we try to split each basic block where there are uses.

In both cases it is expected that enter basic block on stack and exit it on stack as well (if there are liveIn and liveOut respectively).
I understand why we try to split if there are several uses. But if we have on use, we will do unspill/spill operation anyway why not to rely on InlineSpiller to create such new live intervals?

The comment above said that in this case remainder interval (I guess original interval without this one split out) removes the constrains and can be allocated from larger super-class. But Region splitter already isolated this basic block and remainder interval is already decided to be allocated on some register. Why do we need extra split here?
I guess it is related to sub registers again and it makes me thinking that I misunderstand here something.

The trouble with your proposal to remove this line of the code and rely on InlineSpiller is as follows:
The bad scenario looks as follows: we keep the live range for single use which is statepoint.
There is no register pressure on this basic block at the end. As a result this interval will be allocated to some register.
As a result in basic block we will have a unspill, use in statepoint and probably spill after statepoint.
While we could avoid unspill/spill operation in this case at all due to statepoint is ok to accept the value on stack with the same efficiency as it was on register. But in this scenario InlineSpiller will not be triggered at all due to live interval will just be assigned to physical register.
May be I miss something again :(

Ok, I go to learn again about sub-registers/spills and so on because this code makes me thinking I miss something.
Any help would be very useful.

skatkov marked an inline comment as not done.Nov 14 2022, 10:45 PM
skatkov added inline comments.
llvm/lib/CodeGen/SplitKit.cpp
1622

BTW: I disabled split for single instruction and got two test failures.

LLVM :: CodeGen/AMDGPU/spill-vgpr.ll
LLVM :: CodeGen/X86/fshl.ll

For x86, to me, the code becomes better.

For AMDGPU - not sure :)

skatkov marked an inline comment as not done.Nov 15 2022, 12:35 AM
skatkov added inline comments.Nov 15 2022, 12:39 AM
llvm/lib/CodeGen/SplitKit.cpp
1622

ok, it looks like when I'm saying that while splitting single basic block we already decided that we entry on stack and exit on stack, I'm not fully correct. Indeed complementary interval will get RS_Split but it still can be assigned to physical register if

  1. There is an available physical register or
  2. It can evict someone.

It complicates my heuristic...

Hi Serguei,

In both cases it is expected that enter basic block on stack and exit it on stack as well (if there are liveIn and liveOut respectively).

As you found out, this is definitely not true :).

The reason for split, irrespective of where, is to relax the register constraints. You cannot know at this point whether or not the previous live-range will end-up in stack or not.

E.g.,

v1 = defA ...
...  = useB v1
... = call
... = useC v1

As it stands, v1 live-range needs to work with all these contraints:

  • What defA supports (think encoding constraints)
  • What useB supports
  • What useC supports
  • What call preserves

As you can imagine based on which constraints you need to consider you get an allocatable pool of registers of different sizes. Now, when you add some split points, you relax the constraints on the new live-ranges.
E.g., suppose we split v1 right after call. The new live-range will only be constrained by useC and v1 wouldn't have to comply to useC's constraints either. Maybe that's enough to avoid spill, maybe it's not. We don't know at this point. (Plus you need to consider what is available at each different program point based on what is already allocated.)

To step back a little bit, what are you trying to achieve?

Cheers,
-Quentin

BTW, what the commit you mentioned refers to is that these small split live-ranges are supposed to be removed from within the spiller (see InlineSpiller.cpp) when the live-range feeding it has been spilled. This is handled in the code as snippet. Search for snippet in that file to see why it is not doing what you want in your case.

I could be wrong but my guess is you're missing some TargetInstrInfo::foldMemoryOperand targeting. Though I'm not familiar with how statepoints work.

Hi Quentin, thank you for participation in this discussion and your comments.
Today, my plan to go to InlineSpiller and learn in details what I'm missing :)

About the issue I'm trying to solve. I wrote a dedicated (more or less small) test to show the problem (llvm/test/CodeGen/X86/statepoint-split-single-block.ll).

First of all some details about statepoint instruction (related to interaction with register allocator).
Statepoint instruction is semantically a call with some additional information (like deopt state and gc-lives). For this additional information (some elements are represented as virtual registers before register allocation), all we need is two know where the corresponding value is located. It would be perfectly ok if it is a register or stack location. This information will be encoded into stack map during machine instruction lowering.
So difference between statepoint instruction and other instructions that it has operands which do not require physical register as operand and more over they do not prefer register.
InlinerSpiller will successfully fold any load from stack for such operands and unspill will be eliminated.

Return back to my test and problem to solve.
There is incoming argument %arg which has actually three uses and two defs.

  1. def as incoming argument
  2. use in a copy to rdi as it an argument to callee
  3. use in a statepoint instruction as gc-live
  4. def in a statepoint as gc-live can be relocated
  5. use in return statement

2-3-4 are in the same block and 3-4 are on the same instruction.
To disable region spilt and force spill in the first block and unspill in the last block I've added calls to @nocsr which is actually a callee which does not preserve any physical registers.
As a result when we come to basic block with a call.
It observes three uses (2-3-4) and decides to make a split around them. As a result we got a new live interval with eliminated constrains caused by call to @nocsr.
Register allocator has a lot of free callee saved register and assign one of them to this live interval.
Technically we did a good job - made a split and was able to allocate live interval to register.
However complement interval goes through @nocsr constraints and has nothing to do except spilling.

So we enter basic block on stack, do unspill to chosen register rbx, copy rbx to rdi (use 2), use this rbx in 3-4 and finally spill rbx back due to we exit on stack.
However as I said statepoint does not require register. If we did not allocate physical register to our interval and just spill around uses then
we load rdi from stack and did nothing for statepoint.
That it was the test shows.

Now, what was in my mind behind this patch. I thought it is a win-win patch due to I thought about single basic block split as follows:
we enter on stack and exit on stack, we do not care about statepoint and now if there is only one use, do not create new interval and just go on stack.
It makes sense.

It looks like, now I should think about it in the following way:
statepoint instruction in its specific operands does not introduced any constraints (or what constraints?) and so no sense to split it in a separate interval.
That is what in my mind at the moment :)

Probably I should do it in completely other direction - when Inline spiller spill some interval, check its siblings and if they are perfectly good to be on stack, unassing them from register and spill as well...

I did an alternative solution https://reviews.llvm.org/D138093 to extend the snippet definition.
Actually I got almost the same result except test for invoke.
I guess it is due to for invoke the split interval is not in the same basic block when the solution is based on snippets.

Please take a look.

Friendly remainder: ping.