Page MenuHomePhabricator

[calcspillweights] mark LiveIntervals from INLINEASM_BR defs as not spillable
Needs ReviewPublic

Authored by nickdesaulniers on Apr 9 2020, 6:54 PM.

Details

Summary

Typically, most of the existing callbr tests are in the form:

  callbr ...
    to label %fallthrough [label %indirect_target]
...
fallthrough:
...
indirect_target:
...

Optimizations such as loop-sink and mergeicmps have been observed to
produce callbr's in the form:

  callbr ...
    to label %fallthrough [label %indirect_target]
...
indirect_target:
...
fallthrough:
...

Note the order of the BasicBlocks is reversed. This generally isn't a
problem, until we encounter register pressure in this non-canonical
form.

Under register pressure, it has been observed that the greedy
register allocator will spill the registers defined in INLINEASM_BR
MachineInstrs that whose LiveIntervals cross into the copy blocks
split during pre-RA ISel.

This is bad. Having a register spill (ie. a COPY) post a Terminal
MachineInstr is a violation of the MachineVerifiers invariant that
subsequent Terminal MachineInstrs in a MachineBasicBlock after the first
must also be Terminal.

Further, this leads to a cascaded failure where much later in
branch-folder calls to TargetInstructionInfo::analyzeBranch() fail to
properly classify a MachineBasicBlock. This triggers calls to
MachineBasicBlock::CorrectCFGEdges() which removes the indirect
successors of the MachineBasicBlock containing the INDIRECTASM_BR.
Finally, branch-folder removes the INDIRECTASM_BR indirect target
MachineBasicBlocks, as they appear to have no predecessors. This results
in references to labels that have been removed.

We should try harder to force these outputs to stay in registers and not
be spilled.

Some alternatives considered:

  • We could teach loop-sink and mergeicmps to not create "non-canonical" callbrs. I prefer to allow LLVM IR to be flexible; I don't think we should canonicalize an order to the BasicBlocks.
  • We could "canonicalize" the MachineBasicBlocks during pre-RA ISel, such that the MachineBasicBlock following the split block following a block terminated by an INLINEASM_BR was the fallthrough successor. So three MachineBasicBlocks that fallthrough in order. I prefer to allow arbitrary ordering of MachineBasicBlocks leading up to regalloc, which makes regalloc able to handle more generic cases.

It might seem bad to take away freedom from the register allocator, but
in this case and especially when in the canonical form, the
LiveIntervals are quite small, and we generally don't want a register
spill between the INLINEASM_BR and it's first COPY to GPR. Later passes
typically re-order the fallthrough case anyways.

A helpful command I used to observe regalloc in action for debugging
this:

llc -debug-only=calcspillweights -debug-only=spill-code-placement \
  -debug-only=greedy -debug-only=regalloc -regalloc=greedy \
  -stop-after=greedy -verify-regalloc -verify-machineinstrs \
  llvm/test/CodeGen/X86/callbr-asm-regalloc.ll

Diff Detail

Event Timeline

nickdesaulniers created this revision.Apr 9 2020, 6:54 PM

If I'm understanding correctly, the issue is that the INLINEASM_BR is producing a virtual register, we try to spill that register, and the spill is inserted in the same block as the INLINEASM_BR?

I can see two possible approaches here, at a high level:

  1. Forbid directly spilling registers produced by an INLINEASM_BR; instead, force a live interval split, and then we can spill the split interval.
  2. Allow spilling registers produced by an INLINEASM_BR, but teach the spill insertion code to insert the spill into the fallthrough succecssor, as opposed to the INLINEASM_BR block itself.

This patch is sort of along the lines of (1), but I'm not sure this is enough to make the interval split reliably (in which case, the allocator could potentially run out of registers). I haven't spent enough time with the spill weight code to say for sure.

If I'm understanding correctly, the issue is that the INLINEASM_BR is producing a virtual register, we try to spill that register, and the spill is inserted in the same block as the INLINEASM_BR?

I can see two possible approaches here, at a high level:

  1. Forbid directly spilling registers produced by an INLINEASM_BR; instead, force a live interval split, and then we can spill the split interval.
  2. Allow spilling registers produced by an INLINEASM_BR, but teach the spill insertion code to insert the spill into the fallthrough succecssor, as opposed to the INLINEASM_BR block itself.

Re: 2: it looks like InlineSpiller::insertSpill might be the most appropriate place to do that, as it just picks the next MachineInstr to insert the spill before, even in that case it violates the invariant that only Terminal MachineInstr end a MachineBasicBlock (no non-Terminal MachineInstrs after the first Terminal MachineInstr). We could check that std::next(MI)->getOpcode == TargetOpcode::INLINEASM_BR, or !std::next(MI).isTerminator(), though I'm not sure what to do in the latter case.

That would do less work for the general case; the current approach checks the opcode of every MachineInstr that is the source of a LiveInterval, which there are generally a lot. This would only check for the special case when we decided we needed to spill (still occurs frequently for large MachineFunctions, but a lot less than the number of LiveIntervals. Let me see if I can update the patch to do that.

Re: 1: I totally do not understand. :( Are you referring to LiveIntervals::splitSeparateComponents?

This patch is sort of along the lines of (1), but I'm not sure this is enough to make the interval split reliably (in which case, the allocator could potentially run out of registers). I haven't spent enough time with the spill weight code to say for sure.

Is there someone more active in this area you could cc for review? This bug is blocking the use of asm goto w/ outputs in the Linux kernel.

arsenm requested changes to this revision.Apr 10 2020, 4:25 PM

You cannot rely on marking ranges as unspillable. This isn't an option with fast regalloc.

llvm/test/CodeGen/X86/callbr-asm-regalloc.ll
10 ↗(On Diff #256475)

This test is way too complicated. This really needs a MIR test

This revision now requires changes to proceed.Apr 10 2020, 4:25 PM
llvm/test/CodeGen/X86/callbr-asm-regalloc.ll
10 ↗(On Diff #256475)

That's fair. It would be better to more precisely represent and test the issue, which would also aid review. And it's time I learn to write some MIR tests. For dumping MIR, do I just do llc -print-after=finalize-isel and write that to a file? Is there an existing MIR test that you recommend I refer to in order to understand testing for cases where specific registers need to be or not to be spilled?

llvm/test/CodeGen/X86/callbr-asm-regalloc.ll
10 ↗(On Diff #256475)

Nevermind, I found https://llvm.org/docs/MIRLangRef.html#mir-testing-guide. Looks like I have some reading to do.

  • convert .ll test to .mir test, though it fails during YAML parsing...?
  • will simplify test next week, and remove .ll test.

We could check that std::next(MI)->getOpcode == TargetOpcode::INLINEASM_BR, or !std::next(MI).isTerminator(), though I'm not sure what to do in the latter case.

Probably you'd want to check whether the instruction that produces the value is a terminator (which at the moment, should imply it's an INLINEASM_BR).

Re: 1: I totally do not understand. :( Are you referring to LiveIntervals::splitSeparateComponents?

I was going to try to explain a little more, but then I realized I'm not sure I understand how this would work in practice, so probably I just shouldn't try.

Using LiveIntervals in any way is not an option to solve this. They are not available in all fast regalloc and other allocators. The constraints need to be expressed with copies that are terminators, and splitting blocks when necessary

llvm/test/CodeGen/X86/callbr-asm-regalloc.mir
6

huh, that's not yaml...

arsenm added inline comments.Apr 14 2020, 12:30 PM
llvm/test/CodeGen/X86/callbr-asm-regalloc.mir
6

The debug printing format that -print-after gives you is not the same MIR output as -stop-after

  • update MIR test to something that runs, still need to clean it up
  1. Forbid directly spilling registers produced by an INLINEASM_BR; instead, force a live interval split, and then we can spill the split interval.

Using LiveIntervals in any way is not an option to solve this. They are not available in all fast regalloc and other allocators. The constraints need to be expressed with copies that are terminators, and splitting blocks when necessary

Right, but I'll bet that when FastAlloc needs to spill, it also uses InlineSpiller to do so. I want to play with or prototype to see if we can make InlineSpiller not violate MachineVerifier invariants by spilling post terminal instructions.

Some other notes I had jotted down, reviewers please speak up if any of these seem immediately not worth pursuing:

  1. branch-folder should probably assert when removing a MachineBasicBlock that has its address taken. That's twice now we've seen this produce completely garbage code. I don't think we should prevent it, as both times its been a cascading failure, so I think an assert is more appropriate.
  2. assert in inline-spiller if it places a spill after a terminator MachineInstr. This will wind up breaking the MachineVerifier constraint about non terminal instructions post terminators anyways. We might even be able to solve this problem there, maybe.
  3. "canonicalize" the fallthrough case during our block splitting special case in instruction scheduling to put the fallthrough immediately after the INLINEASM_BR's parent MachineBasicBlock (as in this test case, and not any of the existing ones). I suspect this will help reduce overlapping LiveRanges and help reduce register pressure.

And of course https://reviews.llvm.org/D75098.

llvm/test/CodeGen/X86/callbr-asm-regalloc.mir
6

Ok, I think I have it updated. Seems that regalloc is actually a grouping of passes that run, so there's no explicit "Greedy" in -print-after-all and you have to dump the MIR before any of the related regalloc passes (ie. after twoaddressinstruction for x86). MIR/llc is different from LLVM IR/opt in that passes generally are order dependent, so you must dump MIR in the precise form that a given pass expects. Someone please correct me if I'm wrong in my understanding of this point.

I suppose I can first clean up this test by removing the LLVM IR embedded within.

We need a test case for which there are more live registers in use than there are physical GPRs to allocate, and where the register allocator chooses to spill the liveouts from the INLINEASM_BR. @arsenm , I'm not sure how much more I can simplify the test, at least in an automated manner.

llvm/test/CodeGen/X86/callbr-asm-regalloc.mir
6

huh, seems I can't eliminate the LLVM IR from the MIR, as otherwise the MIR becomes invalid due to references of missing globals defined in the LLVM IR. That's uh, kind of a mess, and feels like LLVM IR is not discarded or even discardable when lowered to MIR.

  • drop LLVM IR test, superseded by MIR test.
nickdesaulniers marked 3 inline comments as done.Apr 14 2020, 2:16 PM

branch-folder should probably assert when removing a MachineBasicBlock that has its address taken.

My convcern here is that we aren't very aggressive about dropping blockaddresses at the IR level, so we could end up with an "address-taken" block that doesn't actually have any callbr/indirectbr predecessors.

llvm/test/CodeGen/X86/callbr-asm-regalloc.mir
6

sigh, and this test passes with or without my change to CalcSpillWeights, though llc -O2 -verify-machineinstrs on the .ll fails...I'm missing something that's preventing me from restarting the failure in llc. Will keep digging.

nickdesaulniers marked 6 inline comments as done.Apr 14 2020, 2:40 PM
nickdesaulniers added inline comments.
llvm/test/CodeGen/X86/callbr-asm-regalloc.mir
6

got it, so:

  1. regalloc is a group, you can see delineations of the groups via -verify-machineinstrs, which will run MachineVerifier between groups of passes, not each individual pass. This helped me spot that I needed to -stop-after=machine-scheduler, not -stop-after=twoaddressinstruction.
  2. passing -O2 to llc will influence codegen, and make this particular issue go away.

Posting updated MIR with those 2 corrected.

nickdesaulniers marked an inline comment as done.
  • redump MIR test, use -O2 and -stop-after=instruction-scheduler

For simplifying the test, if you just want to force everything to spill, you can write asm volatile("":::"rax", "rbx", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15");

Using LiveIntervals in any way is not an option to solve this. They are not available in all fast regalloc and other allocators. The constraints need to be expressed with copies that are terminators, and splitting blocks when necessary

As a sanity check, I patched in https://reviews.llvm.org/D75098 and I can still reproduce this failure. I think the problem that TCOPY solves, and the problem I'm trying to solve are orthogonal? So I don't think I'll just be able to solve the case of spilled live outs from INLINEASM_BR via TCOPY.

Ah, I wish I watched https://www.youtube.com/watch?v=hf8kD-eAaxg before looking into this! (Todo list includes https://www.youtube.com/watch?v=ktXFOlyEOtY and https://www.youtube.com/watch?v=IK8TMJf3G6U and https://github.com/nael8r/How-To-Write-An-LLVM-Register-Allocator/blob/master/HowToWriteAnLLVMRegisterAllocator.rst).

I tried canonicallizing the callbr fallthrough, but it makes no difference. I suppose that makes sense, given that the live range doesn't change.

https://reviews.llvm.org/D78586 is also pointing out another issue with callbr return values.

Re: https://reviews.llvm.org/D78586

it looks like livevars marks the physical reg defs in the INLINEASM_BR as dead, because its analysis is intrablock (as stated in a comment at the top of llvm/lib/CodeGen/LiveVariables.cpp), so it doesn't see the uses in the newly split block. I wonder if for INLINEASM_BR, if we extend the analysis so that the split block was analyzed, if this would work? A hacked up patch that marks all registers live post INLINEASM_BR post livevars allows the test case in D78586 to pass, but I haven't looked at the codegen to see if anything terrible happened, or what other invariants were violated. (maybe D78166 or D78234). Actually, generated code looks good. Let me see if that's feasible when cleaned up, tomorrow.

The issue with physical register liveness is what D75098 was trying to solve. The testcase here involves a callbr that produces a virtual register, which is a slightly different problem, I think.

void added a comment.May 2 2020, 5:10 AM

branch-folder should probably assert when removing a MachineBasicBlock that has its address taken.

My convcern here is that we aren't very aggressive about dropping blockaddresses at the IR level, so we could end up with an "address-taken" block that doesn't actually have any callbr/indirectbr predecessors.

If a block has its address taken, whether it's used in a callbr or some other instruction, is it ever safe to remove the block if the blockaddress instruction referencing it still exists?

Also, if some degenerate were to write some heresy like this, would LLVM be able to handle it?

  %ba = blockaddress label %indirect
  store %ba, %some_stack_slot

...

  %ba_reload = load %some_stack_slot
  callbr ... (%ba_reload)

If a block has its address taken, whether it's used in a callbr or some other instruction, is it ever safe to remove the block if the blockaddress instruction referencing it still exists?

The way that indirectbr works is that blockaddress passes out an opaque value, and that address is supposed to be passed to the corresponding indirectbr. It doesn't really matter where the blockaddress is computed; what matters is that there's an indirectbr that branches to the block mentioned in the blockaddress constant.

We don't promise that a blockaddress is usable for any other purpose. If there is no indirectbr to pass the blockaddress to, there isn't anything constraining the opaque value, so it can be transformed to an arbitrary integer.