In the LLVM test suite, very small functions like this are fairly common:
declare i8* @bar(i32*, i32, i32**) define i8* @foo(i32 %x, i32** %y) { %z = tail call i8* @bar(i32* null, i32 %x, i32** %y) ret i8* %z }
In GlobalISel, during CallLowering, we tend to produce copies in the following order (for AArch64):
%0:_(s32) = COPY $w0 %1:_(p0) = COPY $x1 %3:_(p0) = G_CONSTANT i64 0 $x0 = COPY %3(p0) $w1 = COPY %0(s32) $x2 = COPY %1(p0) TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2
SelectionDAG, on the other hand, produces this:
%1:gpr64 = COPY $x1 %0:gpr32 = COPY $w0 %2:gpr64all = COPY $xzr $x0 = COPY %2 $w1 = COPY %0 $x2 = COPY %1 TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2
All that's different here is the order of the copies from physical registers. However, GlobalISel ultimately produces twice as many copies: https://godbolt.org/z/GKGsh5
This is because at some point during greedy regalloc, we (greedily) make the wrong choice of virtual register mapping, and miss the chance to emit identity copies.
It seems like the register allocator is making a reasonable choice with what it's been given. So, personally, I think it makes sense to canonicalize these copies into a form which is amicable to the greedy register allocator.
This patch adds a specialized copy localizer pass. This pass is separate from the main localizer because I think the circumstances where it should be used are more specialized. Only targets (and opt-levels) which use the greedy register allocator can benefit from this.
This pass will reorder copies in a contiguous block of copies from physregs such that the first copy has the *latest* user in the MachineBasicBlock, and the last copy has the *earliest* user in the MachineBasicBlock.
So for this MIR:
%0:_(s32) = COPY $w0 %1:_(p0) = COPY $x1 %3:_(p0) = G_CONSTANT i64 0 $x0 = COPY %3(p0) $w1 = COPY %0(s32) $x2 = COPY %1(p0) TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2
We'll reorder the copies like so:
%1:_(p0) = COPY $x1 %0:_(s32) = COPY $w0 %3:_(p0) = G_CONSTANT i64 0 $x0 = COPY %3(p0) $w1 = COPY %0(s32) $x2 = COPY %1(p0) TCRETURNdi @bar, 0, csr_darwin_aarch64_aapcs, implicit $sp, implicit $x0, implicit $w1, implicit $x2
This
- Minimizes the length of the live range for %0:_(s32) = COPY $w0.
- Ensures that its live range is fully contained within the live range of %1:_(p0) = COPY $x1 rather than partially overlapping with it.
This pass only
- Shuffles the position of copies from physregs within one contiguous range of copies from physregs.
- Runs on MachineFunctions with a single MachineBasicBlock. (More MBBs -> more likely that there's enough stuff going on that this won't make a difference.)
- Runs on very small MachineFunctions. (This problem only seems to show up in very small functions where there isn't much going on other than copies.)
The pass will swap the positions of %0 and %1 because
I think there are a lot of places where this could live (CallLowering, the IRTranslator, the existing Localizer, ...). I'm not particularly attached to any of these options, because I think the algorithm would come out being about just the same. Personally, I think this makes sense as a standalone pass because the circumstances in which is should run, and the scope of what it does is distinct enough that it feels like its own thing.
(FWIW, if it's possible and makes sense to teach the register allocator to handle this situation, I'd prefer that solution. However, my gut feeling says that it's probably better to just canonicalize the copies using a pass.)
For now, only add this pass to AArch64.
I don't understand what you mean here.
The physreg interferes no matter how you arrange the live-ranges.
I don't get the partial part either. (Partial overlapping comes from sub registers, and this example doesn't have any :)).