This is an archive of the discontinued LLVM Phabricator instance.

Add TCOPY, a terminator form of the COPY instr
Needs ReviewPublic

Authored by void on Feb 24 2020, 6:47 PM.

Details

Summary

The INLINEASM_BR instruction's a terminator, and the code generator
doesn't allow non-terminator instructions after a terminator. This is an
issue when an INLINEASM_BR defines a physical register. We can't place
the copies of the physical registers into virtual registers in the
fallthrough block because physical registers can't be marked as
"live-in" until after register allocation.

To get around this issue, we introduce a new pseudo-instruction, TCOPY,
that's identical to the COPY instruction, but is a terminator. With it
we're able to copy the physical registers to virtual registers without
needing to place the copies in a fallthrough block:

bb.1:
  INLINEASM_BR &"" ..., implicit-def $esi, $1:[regdef],...
  %9:gr32 = TCOPY $esi

bb.2:
; predecessors: %bb.1
  successors: %bb.3(0x80000000); %bb.3(100.00%)

  %0:gr32 = COPY %9:gr32
  JMP_1 %bb.3

The TCOPY is converted to a normal COPY after register allocation, when
we have live variable information and live-ins are allowed on basic
blocks.

The fast register allocator behaves a bit differently because everything
is spilled before the end of a basic block. Therefore, we allow a store
of a physical register after the INLINEASM_BR when optimizations are
disabled.

Diff Detail

Event Timeline

void created this revision.Feb 24 2020, 6:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2020, 6:47 PM
arsenm added a subscriber: arsenm.Feb 24 2020, 7:10 PM

This will be useful for AMDGPU, we currently have a set of _term mov instructions for this purpose.

llvm/include/llvm/Support/TargetOpcodes.def
101

s/terminal/terminator

llvm/include/llvm/Target/Target.td
1124–1125

This shouldn't need isBranch or isIndirectBranch

llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
179

I would expect this to not special case INLINEASM_BR. This should be any value defined by a terminator instruction

arsenm added inline comments.Feb 24 2020, 7:14 PM
llvm/include/llvm/Support/TargetOpcodes.def
101

COPY_TERM or COPY_TERMINATOR would be better since it isn't really a branch

lkail added a subscriber: lkail.Feb 24 2020, 7:40 PM
llvm/include/llvm/CodeGen/MachineInstr.h
1129

Does this save a few getOpcode() calls? The rest of this CL explicitly compares the opcode against COPY_BR? I just worry if this might mess up existing callsites of isCopy when getOpcode() == TargetOpcode::COPY_BR;.

llvm/include/llvm/Target/Target.td
1126

COPY also sets hasNoSchedulingInfo to 0. Should COPY_BR do that as well?

llvm/lib/CodeGen/ExpandPostRAPseudos.cpp
191

what? if we could use a range based for loop for the MBB's, surely we can for MI.

llvm/lib/CodeGen/MachineSink.cpp
1398

Does it make sense to "sink" register info invalidation into performSink()? (pun intended) Since it's already checking the MI's opcode? Or are the two call sites of performSink() problematic? The implementation of invalidateLiveness() looks pretty cheap, IMO.

llvm/lib/CodeGen/MachineVerifier.cpp
849

this is hard to read. Is there a nice way to simply this? Maybe negate, and fold into parent if?

void marked 7 inline comments as done.Feb 25 2020, 1:10 PM
void added inline comments.
llvm/include/llvm/CodeGen/MachineInstr.h
1129

It's more than just a few. isCopy() is used pretty extensively. I want the COPY_BR (or COPY_TERM) to be exactly like a normal COPY, with the only exception that it can come after a terminator.

llvm/include/llvm/Support/TargetOpcodes.def
101

How about TCOPY?

llvm/include/llvm/Target/Target.td
1126

Possibly, though I had trouble finding where to place its scheduling info. It seems like COPY is special and TableGen adds it by default. I couldn't find the place to do that for COPY_BR, but I'll take another look.

void added inline comments.Feb 25 2020, 1:10 PM
llvm/lib/CodeGen/ExpandPostRAPseudos.cpp
191

The ++mi below made me worried, and the fact that MI could be erased.

llvm/lib/CodeGen/MachineSink.cpp
1398

Possibly. If we're going to make COPY_TERM (or whatever name we settle on) not specific to INLINEASM_BR, then we'll probably need to modify this to accept any case where we sink copies after a terminator. We'll come back to this after other comments settle.

llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
179

Does this include unconditional jump terminators?

arsenm added inline comments.Feb 25 2020, 1:16 PM
llvm/include/llvm/Target/Target.td
1126

Last time I looked at adding a copy variant, the right solution was just isAsCheapAsAMove = 1. Otherwise it would have required adding the COPY_BR to every target's COPY handling

llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
179

I'm not sure how you could construct a sensible unconditional jump that would require a copy after it. Seems like something to check in the verifier

nickdesaulniers marked an inline comment as done.Feb 25 2020, 1:39 PM

Also, if INLINEASM_BR is a terminator, but we want TCOPY to be a terminator, should this patch make INLINEASM_BR NOT a terminator? Or would that be a follow up patch after creating TCOPY?

Also, none of the tests have TCOPY (or w/e). I feel like any patch adding a new instructions at w/e level of IR should have some test that the instruction exists.

Also, I don't quite follow what the changes to the tests are testing for with this change.

llvm/lib/CodeGen/ExpandPostRAPseudos.cpp
191

Ah sorry, I missed that. I'm not sure what's meant by erased, but I'd be curious if a continue could not be used instead of manual iterator advancement.

void marked an inline comment as done.Feb 25 2020, 3:07 PM

Also, if INLINEASM_BR is a terminator, but we want TCOPY to be a terminator, should this patch make INLINEASM_BR NOT a terminator? Or would that be a follow up patch after creating TCOPY?

INLINEASM_BR being a terminator makes a lot of things "just work" for the code generator. I'm not 100% convinced that it *should* be a terminator, but if we decided to make it a normal instruction, then I think we'd have to touch way too much code to make it worthwhile.

Also, none of the tests have TCOPY (or w/e). I feel like any patch adding a new instructions at w/e level of IR should have some test that the instruction exists.

The tests are the current "asm goto with outputs" tests.

Also, I don't quite follow what the changes to the tests are testing for with this change.

The test changes are testing that we can compile at -O0.

void updated this revision to Diff 246611.Feb 25 2020, 5:48 PM

Don't mark TCOPY as a branch. Also s/COPY_BR/TCOPY/.

void retitled this revision from Add COPY_BR, a terminator form of the COPY instr to Add TCOPY, a terminator form of the COPY instr.Feb 25 2020, 5:52 PM
void edited the summary of this revision. (Show Details)
void marked 5 inline comments as done.Feb 25 2020, 5:56 PM
void added inline comments.
llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
179

My comment wasn't very good. When converting something to a TCOPY, we need to figure out which terminators are candidates for having a TCOPY after them. Or perhaps better, what condition requires a TCOPY instead of a regular COPY.

void updated this revision to Diff 247090.Feb 27 2020, 1:16 PM
void edited the summary of this revision. (Show Details)

Liveness is handled by the PostRA machine sink pass, so there's no need to
invalidate it. Also simplify the logic in the machine verifier that allows
stack dumps after a terminator.

llvm/lib/CodeGen/MachineSink.cpp
845

pass as const&, or make performSink a private method.

void updated this revision to Diff 247103.Feb 27 2020, 1:55 PM

Fix accidental change inclusion in the verifier.

void marked an inline comment as done.Feb 27 2020, 2:35 PM
void added inline comments.
llvm/lib/CodeGen/MachineSink.cpp
845

Two different passes use this function so it can't be made a private method. I"m not sure why passing it as "const&" is better than a "const*"...

void updated this revision to Diff 247115.Feb 27 2020, 2:58 PM

Reformat and satisfy clang-tidy.

llvm/lib/CodeGen/MachineSink.cpp
845

Generally, passing by const reference indicates that a parameter is strictly an input, as opposed to both input AND output, which is why you don't see a mix of pointers and references in this function signature.

void marked an inline comment as done.Feb 27 2020, 4:02 PM
void added inline comments.
llvm/lib/CodeGen/MachineSink.cpp
845

A 'const*' doesn't allow modifications either. Also note that none of the references in this function signature are 'const', and are probably passed by reference because they aren't pointers (SuccToSinkto notwithstanding) in the originating function. Converting this to a reference is not useful.

arsenm added inline comments.Feb 28 2020, 7:52 AM
llvm/include/llvm/CodeGen/MachineInstr.h
1129

I think this is potentially hazardous. I'm worried about the peephole optimizer doing things like folding a copy into a tcopy and losing the terminator bit. Can you add some testcases with coalescable tcopy pairs for PeepholeOptimizer and the register coalescer?

void updated this revision to Diff 247719.Mar 2 2020, 1:26 PM
void marked an inline comment as done.
  • Add a verification step to ensure a COPY doesn't follow a TCOPY.
  • Comment that the predicate for determining whether to use a TCOPY or COPY should be improved, but at this time we don't have enough information to determine the best criteria.
void added inline comments.Mar 2 2020, 3:26 PM
llvm/include/llvm/CodeGen/MachineInstr.h
1129

I'll craft some tests, though note that TCOPY should only be after a terminator, so a COPY should never be merged with it. (I'll add a verifier check to ensure that TCOPY doesn't happen before a terminator.)

void updated this revision to Diff 249489.Mar 10 2020, 1:37 PM

Add TCOPY to some switch statements that handle COPY

void marked an inline comment as done.Mar 10 2020, 4:59 PM
void added inline comments.
llvm/include/llvm/CodeGen/MachineInstr.h
1129

@arsenm I'm struggling a bit to come up with tests that will exercise the peephole optimizer and register coalescer. Do you have any advice on how to do this?

void updated this revision to Diff 250443.Mar 15 2020, 3:14 PM

Rebase and update testcase.

void added a comment.Mar 15 2020, 3:15 PM

Friendly ping. :-)

void added a comment.Mar 18 2020, 3:03 PM

Another friendly ping. :-)

arsenm added inline comments.Mar 30 2020, 4:26 PM
llvm/include/llvm/CodeGen/MachineInstr.h
1129

You shouldn't need a special verifier check, it should be caught by the normal terminator before non-terminator check.

I mean something like
%1 = COPY %0
%2 = TCOPY %1

and then maybe mix in some subregistsers? The worry is the COPY will somehow replace the TCOPY

void marked an inline comment as done.May 4 2020, 3:12 AM
void added inline comments.
llvm/include/llvm/CodeGen/MachineInstr.h
1129

This isn't an issue as far as I can see (at least the kind of example you're referring to). A TCOPY can become a COPY if appropriate. The only time it can't become a COPY is if it occurs after a terminator, in which case it should be caught by current verifier checks. I *don't* think a COPY should become a TCOPY though.

void updated this revision to Diff 261771.May 4 2020, 3:15 AM
  • Add support to MIR parsing.
  • Rebase current change.
arsenm added inline comments.May 4 2020, 7:43 AM
llvm/lib/CodeGen/MIRParser/MIRParser.cpp
312–342 ↗(On Diff #261771)

This is an unrelated change

llvm/lib/CodeGen/MachineSink.cpp
861–864

I don't see this captured in a test?

llvm/lib/CodeGen/MachineVerifier.cpp
629–630

This also looks like a separate change

qcolombet requested changes to this revision.May 4 2020, 9:56 AM

I don't think introducing the TCOPY opcode is a valid solution.
Let's assume the copy doesn't get coalesced, that means it needs to be executed after the inlineasm branch. And this is not going to happen if it sits in the same basic block right after the jump.

What happens if you put the copy at the beginning of each successor?
You mentioned that this is not possible because the live-ins wouldn't be set, but I would expect that this information would be correctly computed when building the liveness information. I.e., I don't expect this to be a problem.

we allow a store of a physical register after the INLINEASM_BR when optimizations are disabled.

That sounds wrong.

Could we step back a little bit, what is the semantic of INLINEASM_BR?

This revision now requires changes to proceed.May 4 2020, 9:56 AM
void marked an inline comment as done.May 4 2020, 3:32 PM

I don't think introducing the TCOPY opcode is a valid solution.
Let's assume the copy doesn't get coalesced, that means it needs to be executed after the inlineasm branch. And this is not going to happen if it sits in the same basic block right after the jump.

What happens if you put the copy at the beginning of each successor?
You mentioned that this is not possible because the live-ins wouldn't be set, but I would expect that this information would be correctly computed when building the liveness information. I.e., I don't expect this to be a problem.

we allow a store of a physical register after the INLINEASM_BR when optimizations are disabled.

That sounds wrong.

Could we step back a little bit, what is the semantic of INLINEASM_BR?

[I'm sorry if some of this is basic. I just want to give a clear description of what's going on and why.]

INLINEASM_BR is an attempt to replicate the behavior of "asm goto". As such, it has one default destination (more-or-less the fallthrough block), and one or more indirect destinations. If the indirect destinations aren't taken, the resulting control flow is to fall out of the bottom of the ASM block. The callbr instruction is a terminator so that it can model this behavior as best as it can (which isn't all that great with LLVM's IR, but doable (see "invoke")). Once it's converted to INLINEASM_BR the behavior obviously doesn't change, but now that we have outputs, we need to figure out how to handle them given the constraints of MIR (i.e. you can't have non-terminals after a terminal, and live-ins are only expected to be correct after register allocation).

Because of how callbr is modeled, INLINEASM_BR is also a terminator. So during ISEL the values that are defined by INLINEASM_BR need to be spilled before jumping to the default block. Because of the constraint against having non-terminators after a terminator, we artificially modify the code after ISEL's finished building the block, placing any spills into a "copy" block (that's between the INLINEASM_BR and its default destination), and adding any registers defined by INLINEASM_BR as "live-ins" to the copy block. This is bad, because it violates the constraints, but was necessary at the time because we had no way to avoid it.

[Note that the outputs from an INLINEASM_BR are valid only on the default branch. It's very difficult to have them on the indirect branches without making the resulting code bad---both in style and in performance.]

Okay, so that's the behavior of INLINEASM_BR and some of the issues we came across that led us to here. I think TCOPY is a good solution because the default behavior of INLINEASM_BR is to fall out the bottom of the ASM block, so it would "naturally" execute the TCOPY instructions. And since we only ensure that the values are valid on the default branch, the COPYs will always be executed when needed (note a block containing an INLINEASM_BR should either exit to the default via a fallthrough or unconditional jump). It's still possible to split the block directly after the INLINEASM_BR and convert the TCOPYs into regular COPYs, in fact I would encourage us to do that at the appropriate point. It could remove a lot of the code in ScheduleDAGSDNodes.cpp.

What happens if you put the copy at the beginning of each successor?

I found that this didn't work. CodeGen wants registers to be spilled at the end of a block. The spill slot is recorded and used later on when accessing the value. If we just add a COPY into the default block it won't have the necessary information available to load the correct value. We would have to add the information as a "live-in" on the default block, but that's a constraint violation, wash-rinse-repeat.

Thanks Bill for the detailed explanations. A couple more questions.

CodeGen wants registers to be spilled at the end of a block.

That's fast reg alloc only. What is happening with the other allocators?

The spill slot is recorded and used later on when accessing the value. If we just add a COPY into the default block it won't have the necessary information available to load the correct value.

Why is that? (Why doesn't it have the necessary information to load the correct value.)

I feel that we are trying to work around a bug/limitation in the fast register allocator where physreg are assumed not to cross basic block boundaries. I wonder if it wouldn't be easier to fix fast reg alloc if that is the problem.

Out-of-curiosity, how do we deal with invoke? I would expect we have the exact same problems.

void added a comment.May 7 2020, 7:10 PM

Thanks Bill for the detailed explanations. A couple more questions.

CodeGen wants registers to be spilled at the end of a block.

That's fast reg alloc only. What is happening with the other allocators?

They don't appear to use the correct registers for the defined values.

The spill slot is recorded and used later on when accessing the value. If we just add a COPY into the default block it won't have the necessary information available to load the correct value.

Why is that? (Why doesn't it have the necessary information to load the correct value.)

I feel that we are trying to work around a bug/limitation in the fast register allocator where physreg are assumed not to cross basic block boundaries. I wonder if it wouldn't be easier to fix fast reg alloc if that is the problem.

It's not so much working around a bug/limitation in any of the register allocators. What I'm most concerned about is having to specify live-ins to MBBs before register allocation. That invariant was a huge sticking point in the review process for this feature. We're currently violating it because it "seems to work", but I don't want to rely on that.

Because of the live-ins restriction, all live physical registers at the end of an MBB need to be "spilled", or at least recorded in a way so that other blocks can access the information without having their live-ins set. (This is my understanding, it may be inaccurate). The SelectionDAGBuilder::CopyToExportRegsIfNeeded() function is one of the ways this is done.

Out-of-curiosity, how do we deal with invoke? I would expect we have the exact same problems.

Invoke is turned into a regular call with exception handling stuff around it. E.g.:

$ cat ex.cpp:
void g();
int f() {
  try {
    g();
  } catch (int&) {
    return 37;
  }
  return 1;
}

$ clang++ -mllvm -print-after-all ex.cpp -c -o /dev/null
...
bb.0 (%ir-block.0):
  successors: %bb.1, %bb.2

  EH_LABEL <mcsymbol .Ltmp0>
  ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
  CALL64pcrel32 @_Z1gv, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp
  ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
  EH_LABEL <mcsymbol .Ltmp1>
  JMP_1 %bb.1
...
void updated this revision to Diff 262856.May 8 2020, 4:34 AM

Remove some of the hackery in the SDNode scheduler that splits below an
INLINEASM_BR instruction. It's not needed with TCOPY.

void updated this revision to Diff 262957.May 8 2020, 2:07 PM

Allow a TCOPY to sink, since it's exactly like a COPY.

void marked an inline comment as done.May 8 2020, 4:36 PM
void added inline comments.
llvm/lib/CodeGen/MachineSink.cpp
861–864

This will happen now that we correctly mark TCOPY as sinkable. I'll see if I can craft an MIR test to explicitly do this.

void updated this revision to Diff 263000.EditedMay 9 2020, 1:49 AM

Spill TCOPY in FastRA right before the TCOPY instr instead of before the
terminators.

void updated this revision to Diff 263010.May 9 2020, 3:33 AM

Fix formatting and clang-tidy warnings.

void updated this revision to Diff 263063.May 10 2020, 3:05 AM

Revert bad formatting that sneaked (snuck?) in.

void added a comment.May 10 2020, 11:36 PM

Thanks Bill for the detailed explanations. A couple more questions.

CodeGen wants registers to be spilled at the end of a block.

That's fast reg alloc only. What is happening with the other allocators?

They don't appear to use the correct registers for the defined values.

I should probably explain a bit better. With a normal INLINEASM instruction, any defined registers are spilled directly afterwards. This is one reason why I want to do the same for INLINEASM_BR with the TCOPY. There could be other ways to achieve the same thing, but I think it would involve placing a larger burden on the code generation passes than needs be.

I think the better fix here is to make INLINEASM_BR _not_ be a terminator -- just like the CALL involved in an invoke is not.

That change turned out to be a bit complicated, and while I had started taking a stab at that a couple months ago, I got distracted by many other things going on...However, I've now dusted off that experiment and got it into a useful state, and will send out a review for this tomorrow, so we can consider both the options.

void added a comment.May 12 2020, 6:40 PM

I think the better fix here is to make INLINEASM_BR _not_ be a terminator -- just like the CALL involved in an invoke is not.

That change turned out to be a bit complicated, and while I had started taking a stab at that a couple months ago, I got distracted by many other things going on...However, I've now dusted off that experiment and got it into a useful state, and will send out a review for this tomorrow, so we can consider both the options.

I've thought about the same thing. I'm right now running up against an issue that Nick pointed out. Having a non-branch as a terminator works until the register allocator wants to spill. Then it could insert a copy where the TCOPY is, which messes up analyzeBranch. I'm interested to see your patch.

I think the better fix here is to make INLINEASM_BR _not_ be a terminator -- just like the CALL involved in an invoke is not.

That change turned out to be a bit complicated, and while I had started taking a stab at that a couple months ago, I got distracted by many other things going on...However, I've now dusted off that experiment and got it into a useful state, and will send out a review for this tomorrow, so we can consider both the options.

I think the better fix here is to make INLINEASM_BR _not_ be a terminator -- just like the CALL involved in an invoke is not.

That change turned out to be a bit complicated, and while I had started taking a stab at that a couple months ago, I got distracted by many other things going on...However, I've now dusted off that experiment and got it into a useful state, and will send out a review for this tomorrow, so we can consider both the options.

I've thought about the same thing. I'm right now running up against an issue that Nick pointed out. Having a non-branch as a terminator works until the register allocator wants to spill. Then it could insert a copy where the TCOPY is, which messes up analyzeBranch. I'm interested to see your patch.

Regardless of what happens to INLINEASM_BR, I think we still need a terminator copy

Regardless of what happens to INLINEASM_BR, I think we still need a terminator copy

What for?
The whole concept is bogus to me because if the previous instructions are really terminators (as in they branch to somewhere else), then this copy would never be executed.

Regardless of what happens to INLINEASM_BR, I think we still need a terminator copy

What for?
The whole concept is bogus to me because if the previous instructions are really terminators (as in they branch to somewhere else), then this copy would never be executed.

Terminator does not imply a branch, it only means glued to the end of the block. It's the only way to get correct spill code placement around exec mask writes. We need a copy to the exec mask, but spill code needs to be inserted before that point. Currently we get unnecessary copies since we use a raw move instructions for this purpose, when we could have coalesced a copy

void added a comment.May 28 2020, 3:38 AM

Regardless of what happens to INLINEASM_BR, I think we still need a terminator copy

What for?
The whole concept is bogus to me because if the previous instructions are really terminators (as in they branch to somewhere else), then this copy would never be executed.

That's not necessarily so, because you can have multiple branching terminators then there must be some that branch conditionally (allowing other code to exist between the jumps), otherwise there wouldn't be a need for multiple terminators.

What's the status of this? I'm still interested in terminator copies

void added a comment.Jul 23 2020, 12:19 PM

What's the status of this? I'm still interested in terminator copies

When we last left our adventurers, there was an issue with live range splitting. In some cases it could emit a spill after terminators, which failed validation.