Page MenuHomePhabricator

[AArch64] Match the windows canonical callee saved register order

Authored by mstorsjo on Wed, Sep 30, 2:02 AM.



On windows, the callee saved registers in a canonical prologue are ordered starting from a lower register number at a lower stack address (with the possible gap for aligning the stack at the top); this is the opposite order that llvm normally produces.

To achieve this, reverse the order of the registers in the assignCalleeSavedSpillSlots callback, and adjust the code for matching up register pairs (to expect them in reverse order). In order to pair them properly when iterating in reverse order, restrict pairing pairs starting with an odd numbered GPR (e.g. for [x19, x20, x21], iterated over backwards, don't pair up x21 with x20, but leave x21 alone and pair x20 with x19).

This allows generated prologs more often to match the format that allows the unwind info to be written as packed info.

I've got an alternative implementation of the same, that keeps the register iteration order normal, but changes computeCalleeSaveRegisterPairs to lay things from the bottom up, instead of top down (contrary to the generic code in PrologEpilogInserter, that still lays out the matching stack objects top down). This results in a slightly smaller code change, but has the effect that the register names for the CSR stack objects (visible in MIR) actually mismatch where the registers really are saved.

I've also got two more patches coming up on top of this one, but I'm holding off the other ones until this one is settled, as one of the later ones depend on the exact form of this one.

With all of them applied, a 228 KB xdata section shrinks by 74 KB thanks to being able to write packed unwind info, ending up with smaller xdata than the corresponding section for an x86_64 build of the same DLL.

Diff Detail

Event Timeline

mstorsjo created this revision.Wed, Sep 30, 2:02 AM
mstorsjo requested review of this revision.Wed, Sep 30, 2:02 AM

With all of them applied, a 228 KB xdata section shrinks by 74 KB thanks to being able to write packed unwind info, ending up with smaller xdata than the corresponding section for an x86_64 build of the same DLL.

Actually, I just found another hopefully pretty simple fix that gets rid of another 46 KB - so the gains in the end are pretty notable!

efriedma added inline comments.Wed, Sep 30, 1:48 PM

This if ends up being a little confusing to read... maybe worth duplicating the if (Reg2 == Reg1 + 1) check into both the integer and fp branches of the if statement.


Do we really need to check NeedsWinCFI here? It seems like the rounding should be a property of the chosen stack layout, which should already be determined at this point.

mstorsjo added inline comments.Wed, Sep 30, 2:10 PM

Hmm, maybe...

I'm a bit dissatisfied with this solution, as it misses some pairing opportunities: E.g. if saving only x20 and x21, but not x19, this code wouldn't manage to pair x20 and x21 - as when iterating over them in the top-to-bottom order, when inspecting x21, it'll choose not to pair it with x20, as it expects x20 to be paired with x19.

But I'm a bit out of ideas of a good way of achieving that, without making a huge mess in the generic code in computeCalleeSaveRegisterPairs...

On the other hand, I'm not sure if that's much of a practical issue that happens often, other than in pathological cases?


We do: If we need to save the set of x19, d8 and d9, the windows stack layout of that is (bottom-up) x19, (d8, d9), gap. When iterating backwards starting from the top, we start out with d9 and pair it with d8 - but we need to add the alignment (to the d9 stack object) to get the gap here at the top. But the pair d8,d9 aren't aligned to a 16 byte boundary themselves.

efriedma added inline comments.Wed, Sep 30, 2:36 PM

Probably rare; the standard register allocation order will use x19 before x20, so ignoring weird edge cases, the only way to get into this situation is if something explicitly clobbers x20. Which probably means inline asm, since there isn't really any other way to trigger that reliably.

On the other hand, I think computeCalleeSaveRegisterPairs would be easier to follow if we actually iterated over the registers in the right order. The other changes are basically working around the fact that we're iterating in the wrong order. So maybe worth doing even if we don't really expect any significant optimization benefit. Hopefully it's not that complicated?

mstorsjo added inline comments.Thu, Oct 1, 2:08 AM

That was actually my original approach, and I have a separate version of this patch that does exactly that.

The reason why I preferred this one, is that it keeps the association between stack objects (as allocated and laid out by PrologEpilogInserter's calculateFrameObjectOffsets) more straightforward.

But I can put that alternative version up for review and point out the (potentially?) problematic mismatch there.

mstorsjo abandoned this revision.Thu, Oct 1, 11:36 PM

Superseded by D88677.