This is an archive of the discontinued LLVM Phabricator instance.

[GlobalISel] Add a localizer for copies from physregs and use it in AArch64
Needs RevisionPublic

Authored by paquette on Sep 4 2020, 12:51 PM.

Details

Summary

In the LLVM test suite, very small functions like this are fairly common:

declare i8* @bar(i32*, i32, i32**)
define i8* @foo(i32 %x, i32** %y) {
  %z = tail call i8* @bar(i32* null, i32 %x, i32** %y)
  ret i8* %z
}

In GlobalISel, during CallLowering, we tend to produce copies in the following order (for AArch64):

%0:_(s32) = COPY $w0
%1:_(p0) = COPY $x1
%3:_(p0) = G_CONSTANT i64 0
$x0 = COPY %3(p0)
$w1 = COPY %0(s32)
$x2 = COPY %1(p0)
TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2

SelectionDAG, on the other hand, produces this:

%1:gpr64 = COPY $x1
%0:gpr32 = COPY $w0
%2:gpr64all = COPY $xzr
$x0 = COPY %2
$w1 = COPY %0
$x2 = COPY %1
TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2

All that's different here is the order of the copies from physical registers. However, GlobalISel ultimately produces twice as many copies: https://godbolt.org/z/GKGsh5

This is because at some point during greedy regalloc, we (greedily) make the wrong choice of virtual register mapping, and miss the chance to emit identity copies.

It seems like the register allocator is making a reasonable choice with what it's been given. So, personally, I think it makes sense to canonicalize these copies into a form which is amicable to the greedy register allocator.

This patch adds a specialized copy localizer pass. This pass is separate from the main localizer because I think the circumstances where it should be used are more specialized. Only targets (and opt-levels) which use the greedy register allocator can benefit from this.

This pass will reorder copies in a contiguous block of copies from physregs such that the first copy has the *latest* user in the MachineBasicBlock, and the last copy has the *earliest* user in the MachineBasicBlock.

So for this MIR:

%0:_(s32) = COPY $w0
%1:_(p0) = COPY $x1
%3:_(p0) = G_CONSTANT i64 0
$x0 = COPY %3(p0)
$w1 = COPY %0(s32)
$x2 = COPY %1(p0)
TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2

We'll reorder the copies like so:

%1:_(p0) = COPY $x1
%0:_(s32) = COPY $w0
%3:_(p0) = G_CONSTANT i64 0
$x0 = COPY %3(p0)
$w1 = COPY %0(s32)
$x2 = COPY %1(p0)
TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2

This

  1. Minimizes the length of the live range for %0:_(s32) = COPY $w0.
  2. Ensures that its live range is fully contained within the live range of %1:_(p0) = COPY $x1 rather than partially overlapping with it.

This pass only

  1. Shuffles the position of copies from physregs within one contiguous range of copies from physregs.
  2. Runs on MachineFunctions with a single MachineBasicBlock. (More MBBs -> more likely that there's enough stuff going on that this won't make a difference.)
  3. Runs on very small MachineFunctions. (This problem only seems to show up in very small functions where there isn't much going on other than copies.)

The pass will swap the positions of %0 and %1 because

I think there are a lot of places where this could live (CallLowering, the IRTranslator, the existing Localizer, ...). I'm not particularly attached to any of these options, because I think the algorithm would come out being about just the same. Personally, I think this makes sense as a standalone pass because the circumstances in which is should run, and the scope of what it does is distinct enough that it feels like its own thing.

(FWIW, if it's possible and makes sense to teach the register allocator to handle this situation, I'd prefer that solution. However, my gut feeling says that it's probably better to just canonicalize the copies using a pass.)

For now, only add this pass to AArch64.

Diff Detail

Event Timeline

paquette created this revision.Sep 4 2020, 12:51 PM
paquette requested review of this revision.Sep 4 2020, 12:51 PM

Hi Jessica,

This feels to me that we are adding another level of heuristics that could well degrade the performances on other cases. The fact that this runs only on functions with only one basic block and with less than 25 instructions tells me this is really a workaround against a more general problem.
Put differently, I don't see that this is generally useful to be worth the compile time.

Solving that properly IMHO would require to model batch of copies as parallel copies (all the copies happen in parallel in a bundle) and we would serialize them only after register allocation. We already do something like that in SplitKit IIRC.

I am guessing your concern is that we regress some cases compared to SDISel (I would also expect that we improve some as well!), thus I wondering if we should try to "fix" this at all.
If we need to fix this, could we instead issue the arguments in the same order as SDISel in the meantime? (While we bring up the proper support for parallel copies.)

Cheers,
-Quentin

llvm/include/llvm/CodeGen/GlobalISel/CopyLocalizer.h
31

I don't understand what you mean here.
The physreg interferes no matter how you arrange the live-ranges.

I don't get the partial part either. (Partial overlapping comes from sub registers, and this example doesn't have any :)).

45

Before we had (read '->' == interferes with):
%x1 -> py, pz, x2, x3
%x2 -> pz, pa, x1, x3
%x3 -> pa, pb, x1, x2

After we have:
%x1 -> x2, x3
%x2 -> pz, px, pa, x1, x3
%x3 -> px, pa, pb, x1, x2

What is the benefit of doing so?
We basically relaxed the constraint on x1, because increased it on x2 and x3.

I remember you mention that the change to CallLowering approach to this issue resulted in worse perf or code size. That does seem odd to me since we should be just aping SDAG, so even if we had some regressions we shouldn't be worse overall.

Hi Jessica,

This feels to me that we are adding another level of heuristics that could well degrade the performances on other cases. The fact that this runs only on functions with only one basic block and with less than 25 instructions tells me this is really a workaround against a more general problem.
Put differently, I don't see that this is generally useful to be worth the compile time.

I don't disagree; I'm not particularly attached to or happy with this approach. :)

Solving that properly IMHO would require to model batch of copies as parallel copies (all the copies happen in parallel in a bundle) and we would serialize them only after register allocation. We already do something like that in SplitKit IIRC.

I think I prefer this.

I am guessing your concern is that we regress some cases compared to SDISel (I would also expect that we improve some as well!), thus I wondering if we should try to "fix" this at all.

I found a surprising number of places where this happens. It's extremely noticeable in small functions containing tail calls.

These are some of the cases I found:

https://godbolt.org/z/xakndt
https://godbolt.org/z/pnK5xQ
https://godbolt.org/z/ziYBUs
https://godbolt.org/z/fwmRvE
https://godbolt.org/z/w5uPHH
https://godbolt.org/z/xJVMgp
https://godbolt.org/z/ZDXn-_
https://godbolt.org/z/fcDWGT
https://godbolt.org/z/-trkgD
https://godbolt.org/z/faLS0z
https://godbolt.org/z/xUdnUq

I suspect there are more cases like this. These are just the ones I found by manually clicking through code size regressions I found with a script.

If we need to fix this, could we instead issue the arguments in the same order as SDISel in the meantime? (While we bring up the proper support for parallel copies.)

I tried doing that, and it made some cases better, and some cases worse. (Like you mentioned, we improve some things!) It didn't seem like there was a strong improvement or regression, so I don't think it's worth doing.

If there's a proper way to model what we're missing here (e.g. parallel copies), I think it would be best to just wait for that.

arsenm requested changes to this revision.Aug 18 2023, 6:39 AM

I think we need some kind of scheduler / physreg copy scheduler, not necessarily a copy cloning approach

This revision now requires changes to proceed.Aug 18 2023, 6:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2023, 6:39 AM