Page MenuHomePhabricator

[IR] add new callbrpad instruction
AbandonedPublic

Authored by nickdesaulniers on Dec 7 2022, 11:17 AM.

Details

Summary

In order to properly support outputs along indirect edges of callbr, it
would be helpful to have distinct SSA values to refer to so that during
instruction selection, we can mark outputs that might come from specific
physregs as live-in and place copies to virtregs.

Please see the corresponding RFC accompanying this patch.

Link: https://github.com/llvm/llvm-project/issues/53562
Link: https://discourse.llvm.org/t/rfc-syncing-asm-goto-with-outputs-with-gcc/65453
Link: https://discourse.llvm.org/t/rfc-asm-goto-with-output-constraints/52354/4

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript
nickdesaulniers requested review of this revision.Dec 7 2022, 11:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 7 2022, 11:17 AM

Missing the LangRef and release notes parts

It's not clear to me an IR instruction is necessary. I mean, I understand why you want a callbrpad-equivalent MachineInstr: the only existing infrastructure for jumping out of the middle of a block is the exception-handling infrastructure, so you want to handle callbr in a similar way. I guess if you have a MachineInstr, but you don't have an IR instruction, doing the translation would involve SSA construction in isel, which is maybe a bit awkward. But I'm not sure that's sufficient reason to introduce an IR instruction. At the IR level, callbr is a terminator, and we produce the same value along all edges, so IR transforms don't benefit from distinguishing between direct and indirect edges.

If we do decide to go with an IR instruction, making the callbrpad optional seems likely to cause confusion in IR passes, in terms of what transformations are actually legal. To make this formulation work, indirect destinations have to be distinct from the direct destination. (Otherwise, it's not clear whether values produced directly are actually available in a given block.) If callbrpad is mandatory, this falls out in an obvious way: the direct destination can't contain a callbrpad, and the indirect destinations must contain one, so they can't get folded together. If it's optional, we need extra checks in a bunch of places to make sure the destinations don't get folded together.

llvm/docs/LangRef.rst
12159

Having the callbr as an operand seems redundant. We can already look up the corresponding callbr by just grabbing the terminator of the predecessor.

llvm/test/Verifier/callbr.ll
74

Using CHECK-NOT like this is really fragile. I'd suggest having two tests: one that tests that valid IR isn't rejected, and one that checks that the expected diagnostics are produced for invalid IR.

It's not clear to me an IR instruction is necessary. I mean, I understand why you want a callbrpad-equivalent MachineInstr: the only existing infrastructure for jumping out of the middle of a block is the exception-handling infrastructure, so you want to handle callbr in a similar way. I guess if you have a MachineInstr, but you don't have an IR instruction, doing the translation would involve SSA construction in isel, which is maybe a bit awkward. But I'm not sure that's sufficient reason to introduce an IR instruction. At the IR level, callbr is a terminator, and we produce the same value along all edges, so IR transforms don't benefit from distinguishing between direct and indirect edges.

Consider the following control flow pattern psuedocode, without callbrpad.

  %0 = callbr i32 ... to %direct [%indirect]
direct:
  br label %direct_out
direct_out:
  ret i32 %0
indirect:
  br label %indirect_out
indirect_out:
  ret i32 %0

Where would you insert the potentially-necessary COPYs? Which MBBs would you mark as having physreg liveins? SelectionDAG thinks the value of the callbr is the first virtreg COPY after the INLINEASM_BR, which is incorrect for output branches. SelectionDAG has a 1:1 mapping of Value* to virtual register (FuncInfo.ValueMap used in this patch). @jyknight suggested it might be nice to have distinct SSA values for each indirect edge.

If we do decide to go with an IR instruction, making the callbrpad optional seems likely to cause confusion in IR passes, in terms of what transformations are actually legal. To make this formulation work, indirect destinations have to be distinct from the direct destination. (Otherwise, it's not clear whether values produced directly are actually available in a given block.) If callbrpad is mandatory, this falls out in an obvious way: the direct destination can't contain a callbrpad, and the indirect destinations must contain one, so they can't get folded together. If it's optional, we need extra checks in a bunch of places to make sure the destinations don't get folded together.

See @test5 added in llvm/test/CodeGen/X86/callbr-asm-outputs-indirect-isel.ll, which is a case of common direct and indirect destination. We wind up with superfluous copies in the parent MBB, but the correct registers from expanding the callbrpad. It's not the most optimal code immediately after isel, but it looks correct to me, and those dead copies do get wiped out by later passes.

nickdesaulniers added inline comments.Dec 7 2022, 2:09 PM
llvm/docs/LangRef.rst
12159

My patch relies heavily on the fact that given a callbr, what was the first virtual register assigned as its output. FuncInfo.ValueMap is that mapping. Without any explicit users of a value in IR, SelectionDAG does not retain the mapping for that value.

SelectionDAG also doesn't contain a mapping of callbr Instruction to INLINEASM_BR MachineInstr.

So if you query FuncInfo.ValueMap for the callbr (whether it's an explicit operand or not where you look at the terminator of your predecessor), you will get either the first virtual register after the INLINEASM_BR, or nothing.

It's not clear to me an IR instruction is necessary. I mean, I understand why you want a callbrpad-equivalent MachineInstr: the only existing infrastructure for jumping out of the middle of a block is the exception-handling infrastructure, so you want to handle callbr in a similar way. I guess if you have a MachineInstr, but you don't have an IR instruction, doing the translation would involve SSA construction in isel, which is maybe a bit awkward. But I'm not sure that's sufficient reason to introduce an IR instruction. At the IR level, callbr is a terminator, and we produce the same value along all edges, so IR transforms don't benefit from distinguishing between direct and indirect edges.

Consider the following control flow pattern psuedocode, without callbrpad.

  %0 = callbr i32 ... to %direct [%indirect]
direct:
  br label %direct_out
direct_out:
  ret i32 %0
indirect:
  br label %indirect_out
indirect_out:
  ret i32 %0

Where would you insert the potentially-necessary COPYs? Which MBBs would you mark as having physreg liveins? SelectionDAG thinks the value of the callbr is the first virtreg COPY after the INLINEASM_BR, which is incorrect for output branches. SelectionDAG has a 1:1 mapping of Value* to virtual register (FuncInfo.ValueMap used in this patch). @jyknight suggested it might be nice to have distinct SSA values for each indirect edge.

To expand on this; we might like to conceptually have one value %0 in llvm ir, but in MIR, we'd need either a distinct virtreg per edge (or multiple defs of the same virt reg, which I _think_ MIR might actually support).

As a general rule, I prioritize making the IR less awkward to deal with, vs. making things easier for SelectionDAG.

To expand on this; we might like to conceptually have one value %0 in llvm ir, but in MIR, we'd need either a distinct virtreg per edge

Yes, it would require some sort of translation.

See @test5 added in llvm/test/CodeGen/X86/callbr-asm-outputs-indirect-isel.ll, which is a case of common direct and indirect destination. We wind up with superfluous copies in the parent MBB, but the correct registers from expanding the callbrpad. It's not the most optimal code immediately after isel, but it looks correct to me, and those dead copies do get wiped out by later passes.

I suspect it doesn't really work in general, but I'm not sure off the top of my head what actually breaks. It's hard to reason about, in any case.

we'd need either ... or multiple defs of the same virt reg, which I _think_ MIR might actually support.

I tried this quickly (via llc -stop-after=finalize-isel; <hand modify mir>; llc -start-after=finalize-isel;. Looks like this triggers an assertion immediately:
-mtriple=aarch64:

AArch64StackTaggingPreRA::runOnMachineFunction(llvm::MachineFunction &): Assertion `MRI->isSSA()' failed.

-mtriple=x86_64:

MachineFunctionProperties required by Machine Common Subexpression Elimination pass are not met by function test5.

So it's not legal for selectiondag to produce multiple definitions of the same virt reg on different paths. (I guess I'm not really sure why MachineInstr::defs() returns an iterator then, implying there's possibly more than one def of a virtreg).

I haven't fully caught up here, but, riffing on efriedma's comments --

Are there IR-level optimizations that will cause a problem without there being separate SSA values here? Or would it work to have a codegen pipeline pre-isel lowering that adds the callbrpad instructions as necessary? That is, make this an internal implementation detail of the IR->ISel lowering -- even though we'd actually be adding at the IR level first, for internal implementability reasons.

I haven't fully caught up here, but, riffing on efriedma's comments --

Are there IR-level optimizations that will cause a problem without there being separate SSA values here?

I can't recall any.

Or would it work to have a codegen pipeline pre-isel lowering that adds the callbrpad instructions as necessary? That is, make this an internal implementation detail of the IR->ISel lowering -- even though we'd actually be adding at the IR level first, for internal implementability reasons.

Would this be part of SelectionDAGIsel, or a distinct pass that's part of codgen ( TargetPassConfig::addISelPasses)? (That might be a nice place to do critical edge splitting, if necessary (see also: D138078)) Would we produce callbrpad instructions there, or not?

If we do decide to go with an IR instruction, making the callbrpad optional seems likely to cause confusion in IR passes, in terms of what transformations are actually legal.

So I think this is why @void suggested I create a new instruction, rather than use an intrinsic (I have another impl of this patch using an intrinsic); by marking that instruction isEHPad most transforms know already not to touch.

nickdesaulniers added a comment.EditedDec 7 2022, 11:54 PM

Ok, new idea for an approach based on feedback thus far (comments welcome and appreciated):

  1. allow the use of callbr values along ANY edge; direct or indirect. i.e. restore https://reviews.llvm.org/D135997.
  2. Create a new function pass (as part of TargetPassConfig::addISelPasses so it runs before isel on LLVM IR) that:
    1. for each basic block terminator if it's a callbr that has outputs that have uses:
      1. splits indirect edges that happen to also be critical edges (if any) (akin to https://reviews.llvm.org/D138078, but in a dedicated pass, not during ISEL but just before as part of the TargetPassConfig::addISelPasses pipeline)
      2. inserts a new SSA value produced from either a new instruction, intrinsic call, or perhaps even just a single entry phi. This will be used a marker during SelectionDAGBuilder to denote where physreg to virtreg copies need to go. It need not reference the corresponding callbr since we split critical edges in 2.A.a previously, so it's unambiguous that our corresponding callbr is the terminator of my lone predecessor. (This will cause SelectionDAGBuilder's FunctionLoweringInfo to no longer retain the callbr value to virtreg mapping if there are no other users of the callbr value along the direct edge, but this is ok; see 3 below).
      3. Replaces the uses dominated by the indirect edge to use the newly created value. We might need to insert phis where previously there were none (perhaps the SSA construction @efriedma refers to?).
  3. Add a mapping from CallBrInst to either INLINEASM_BR MachineInstr or first defined virtreg (not sure yet), probably in SelectionDAGBuilder's FunctionLoweringInfo instance member. Create an entry in this mapping when lowering a callbr or inlineasm. Consume this mapping info when visiting the "new value" created from 2.A.b above to insert COPYs as necessary.

By having a dedicated pass, we can better test these transforms in isolation rather than as part of SelectionDAGISEL or SelectionDAGBuilder, which can help us be more confident in the correctness. This approach will create new SSA values along each indirect edge (as per @jyknight ), should not be too much overhead for programs that don't use callbr since we only need to check whether each basic block's terminator is a callbr (as per @void ). Then we do some SSA construction without new instructions or IR special cases (as per @efriedma ). This should remove the special cases for callbr we have when computing dominance as well.

Having a dedicated pass will help us test oddities like a diamond control flow pattern, with the def via callbr in the top, then use in the bottom (with or without phi). Without a phi will need one inserted with the new SSA value as one potential value. With a phi we'll need to update the correct branch+value pair.

Thoughts?

nhaehnle added a subscriber: nhaehnle.EditedDec 8 2022, 11:17 AM

From an LLVM IR perspective, it seems there is inherently the issue that whatever code the callbr invokes may produce values that are available in all successors, but it may also produce values that are only available in some subset of successors. In the spirit of keeping LLVM IR simple, can we solve that issue in a way that isn't too focused on the details of inline assembly?

How about:

  • The value defined by the callbr is accessible to all successors, i.e. all successors of a callbr are equal and there's no special treatment in the dominator tree and/or SSA checking.
  • For successors that do get additional values that are not universally available, callbrpad can be used. Again, all successors of a callbr are equal, and the "direct" successor may have a callbrpad.

An observation that may help with the design is that if we had basic block arguments, then the callbrpad would just be another argument next to potential phis. This makes me dislike the name callbrpad as the term "pad" is slightly vague and seems to suggest something more complex going on. How about callbrdef or maybe even just terminatordef? The point is that the instruction in reality merely represents a value that is defined by the terminator of the predecessor.

(Side note: I don't understand why the diff puts so much emphasis on physical registers in the description of callbrpad. I would expect the issue to exist regardless of how the inlineasm constrains the output.)

Thanks for the feedback!

but it may also produce values that are only available in some subset of successors.

Not precisely. All output values should be available in all successors as a result of this feature.

and the "direct" successor may have a callbrpad.

Yes, it would be possible to require all edges have a landing pad, regardless of direct vs indirect. Perhaps an ergonomics decision.

This makes me dislike the name callbrpad

My heart isn't set on the name, but it sounds like @efriedma thinks that a new instruction is too heavyweight a solution to this problem. I look forward to @efriedma 's thoughts on my latest idea. https://reviews.llvm.org/D139565#3980574 before I proceed further with ANY approach.

(Side note: I don't understand why the diff puts so much emphasis on physical registers in the description of callbrpad. I would expect the issue to exist regardless of how the inlineasm constrains the output.)

I tried to allude to this in the RFC but TL;DR:

  1. MIR doesn't allow for phi's to have physregs, only virtregs. We need to COPY physregs (which the inline asm constraints might explicitly require) to virtregs then use those in phis.
  2. It's possible to express two callbr's with mutually exclusive phyreg constraints for the same output variable which have the same indirect destination. So we must have a phi, which can't have physregs, so we need copies. Those copies need to go somewhere, and it's not correct/appropriate to put them in the predecessor or successor for such critical edges.

Ok, new idea for an approach based on feedback thus far (comments welcome and appreciated):

  1. allow the use of callbr values along ANY edge; direct or indirect. i.e. restore https://reviews.llvm.org/D135997.

Right.

  1. Create a new function pass (as part of TargetPassConfig::addISelPasses so it runs before isel on LLVM IR) that:

If it isn't practical to lower correctly in SelectionDAG without mutating the IR, that's fine. I suspect GlobalISel doesn't want an IR pass, though; it's probably more straightforward there to lower callbr directly to a G_CALLBR instruction, then do the lowering directly on MIR . So making it part of SelectionDAG isel might be better? I guess GlobalISel can just ignore the intrinsic if it wants, though, so maybe a separate pass is fine. (We can add a flag to dump the IR after the transform if you're worried about testing.)

If you do make it a separate pass, you probably want it pretty close to isel so other transforms don't get confused.

  1. for each basic block terminator if it's a callbr that has outputs that have uses:
    1. splits indirect edges that happen to also be critical edges (if any) (akin to https://reviews.llvm.org/D138078, but in a dedicated pass, not during ISEL but just before as part of the TargetPassConfig::addISelPasses pipeline)

Right, you need to split to insert the copies in the right place, in general.

  1. Replaces the uses dominated by the indirect edge to use the newly created value. We might need to insert phis where previously there were none (perhaps the SSA construction @efriedma refers to?).

Right, this is what I was referring to when I mentioned SSA construction. (You'd probably use the llvm::SSAUpdater utility on IR.)

  1. Add a mapping from CallBrInst to either INLINEASM_BR MachineInstr or first defined virtreg (not sure yet), probably in SelectionDAGBuilder's FunctionLoweringInfo instance member. Create an entry in this mapping when lowering a callbr or inlineasm. Consume this mapping info when visiting the "new value" created from 2.A.b above to insert COPYs as necessary.

Not sure you're guaranteed to visit the callbr before its successors?

nickdesaulniers added a comment.EditedDec 8 2022, 2:21 PM

Thanks for the feedback. It sounds like the approach I outlined isn't too unreasonable?

  1. Create a new function pass (as part of TargetPassConfig::addISelPasses so it runs before isel on LLVM IR) that:

If it isn't practical to lower correctly in SelectionDAG without mutating the IR, that's fine. I suspect GlobalISel doesn't want an IR pass, though; it's probably more straightforward there to lower callbr directly to a G_CALLBR instruction, then do the lowering directly on MIR . So making it part of SelectionDAG isel might be better? I guess GlobalISel can just ignore the intrinsic if it wants, though, so maybe a separate pass is fine. (We can add a flag to dump the IR after the transform if you're worried about testing.)

If you do make it a separate pass, you probably want it pretty close to isel so other transforms don't get confused.

GlobalISel cannot lower callbr today. See IRTranslator::translateCallBr. Is GlobalISel "driven" by SelectionDAGISel, (like SelectionDAG is?) (Sorry, my understanding of the instruction selection frameworks in LLVM is...nascent and partial).

I think perhaps it makes sense to create transform machinery distinct from a pass. That way we can start with a distinct IR pass prior to SelectionDAG (as part of the TargetPassConfig::addISelPasses group) which uses said transform machinery and is testable, then transition/reuse it from elsewhere if necessary if/when we implement IRTranslator::translateCallBr in GlobalISEL.

EDIT: I'm quite sure that TargetPassConfig::addISelPasses are run prior to whatever instruction selection framework is ultimately used.

  1. Replaces the uses dominated by the indirect edge to use the newly created value. We might need to insert phis where previously there were none (perhaps the SSA construction @efriedma refers to?).

Right, this is what I was referring to when I mentioned SSA construction. (You'd probably use the llvm::SSAUpdater utility on IR.)

oh, thank god that exists and thanks for pointing me to it! I was losing sleep last night thinking about how difficult this would be for me to do...that class looks very very helpful for this problem.

  1. Add a mapping from CallBrInst to either INLINEASM_BR MachineInstr or first defined virtreg (not sure yet), probably in SelectionDAGBuilder's FunctionLoweringInfo instance member. Create an entry in this mapping when lowering a callbr or inlineasm. Consume this mapping info when visiting the "new value" created from 2.A.b above to insert COPYs as necessary.

Not sure you're guaranteed to visit the callbr before its successors?

I'm not certain either; it looks like SelectionDAGISel::SelectAllBasicBlocks does a ReversePostOrderTraversal of the basic blocks of a function. I think that means we visit the callbr before its successors, but I'm not sure what that means when we have a callbr loop back on itself (such that the basic block is both its own successor and predecessor, maybe edge splitting can help; EDIT: pretty sure that's just another critical edge we can split).

b. inserts a new SSA value produced from either a new instruction, intrinsic call, or perhaps even just a single entry phi.

Any thoughts on what this "token" should be? i.e. a new instruction, call to a new intrinsic, one entry phi with metadata, <other>?

I think perhaps it makes sense to create transform machinery distinct from a pass.

Sure.

GlobalISel cannot lower callbr today. See IRTranslator::translateCallBr. Is GlobalISel "driven" by SelectionDAGISel, (like SelectionDAG is?)

GlobalISel is completely separate; it doesn't use SelectionDAG infrastructure at all. IRTranslator generates MachineInstrs using G_* instruction, then successive passes optimize and lower those instructions into native opcodes. So unlike SelectionDAG, you can conveniently do cross-block transforms like splitting edges. (There's a fallback mechanism, but that just basically invokes SelectionDAG from scratch for the whole function.)

I'm not certain either; it looks like SelectionDAGISel::SelectAllBasicBlocks does a ReversePostOrderTraversal of the basic blocks of a function.

The point of this digression is just that it might be more straightforward to avoid depending on anything in particular being present in the FunctionLoweringInfo .

Any thoughts on what this "token" should be? i.e. a new instruction, call to a new intrinsic, one entry phi with metadata, <other>?

Intrinsic is hopefully sufficient here. (One entry PHI is very fragile, new instruction is a lot of work to implement.)

I'm not certain either; it looks like SelectionDAGISel::SelectAllBasicBlocks does a ReversePostOrderTraversal of the basic blocks of a function.

The point of this digression is just that it might be more straightforward to avoid depending on anything in particular being present in the FunctionLoweringInfo .

So one idea/prototype I did have relied on basically reparsing the constraint string of the inline asm via TargetLowering::ParseConstraints, as is done for SelectionDAGBuilder::visitInlineAsm (but just for the outputs). I guess it's a fair point that we might not yet have an entry for the callbr value in the FunctionLoweringInfo yet. So perhaps I will have to parse the constraints in such a case, then use RegsForValue to define the initial virtual register. I believe then SelectionDAG will find+reuse that virtreg for the callbr when it is eventually encountered. I believe this is how SelectionDAGBuilder can define virtregs for values when the use is visited before the def.

Any thoughts on what this "token" should be? i.e. a new instruction, call to a new intrinsic, one entry phi with metadata, <other>?

Intrinsic is hopefully sufficient here. (One entry PHI is very fragile, new instruction is a lot of work to implement.)

SGTM. Many intrinsics are categorized as "Code Generator Intrinsics" (https://llvm.org/docs/LangRef.html#code-generator-intrinsics) so they seem an appropriate tool here. Hopefully they don't live long enough to need any of the isEHPad pinning as is done in this commit.

(If folks have other feedback/ideas/compliments/insults, I'd be happy to hear them before embarking on the above plan; I'll summarize it in the RFC https://discourse.llvm.org/t/rfc-syncing-asm-goto-with-outputs-with-gcc/65453/5 too).

nickdesaulniers abandoned this revision.Dec 15 2022, 5:21 PM