Unfortunately, in practice, this isn't ever matched by the code generated by LLVM, as the frame layout doesn't match the canonical layout required by the packed format.
Originally I didn't, I was just trying to fulfill my curiosity regarding this (that has ended up taking way much longer than I originally anticipated), but we'll see if I end up nerd-sniping myself into it in the end.
I'm also toying with a relaxed version of this patch, that just checks whether the layout of stored registers on the stack is right, even though the exact sequence of operations isn't the expected one. E.g. in one case, llvm generates
sub sp, sp, #32 stp x19, x20, [sp, #16]
which generates the same stack layout as
stp x19, x20, [sp, #-16]! sub sp, sp, #16
which is possible to describe with packed unwind info. As long as unwinding doesn't happen within prologues/epilogues, it should be equal. Not sure if one wants to do that for real (the length of the synthesized canonical prologue might differ a little from the actual one maybe), but it's at least interesting to play with, to see how big gains there potentially could be by tweaking the frame lowering.
I'm not entirely sure how to interpret that clause.
I'm looking at one case, where MSVC generates the following prologue:
0: fd 7b bb a9 stp x29, x30, [sp, #-80]! 4: fd 03 00 91 mov x29, sp 8: ff 43 00 d1 sub sp, sp, #16
However, the final sub sp, sp, #16 isn't part of the canonical prologue, and this is described as packed unwind info RegI=0, CR=3, FrameSize=80, which llvm-readobj prints as
Prologue [ mov x29, sp stp x29, lr, [sp, #-80]! end ]
However the actual epilogue of the generated code looks like this:
dc: ff 43 00 91 add sp, sp, #16 e0: fd 7b c5 a8 ldp x29, x30, [sp], #80
(Here, instead of mov sp, x29 it has add sp, sp #16, but it doesn't matter as they're equivalent here.)
So if unwinding from the body, the prologue is executed, and it starts off by reversing mov x29, sp from the prologue. But if choosing to execute the epilogue instead of the prologue, it has to execute that instruction, to get the right stack pointer.
From testing the unwinder behaviour with RtlVirtualUnwind, it does look like it does execute mov sp, x29 as part of the epilogue for these cases though, which would make sense.
So I guess it should be ok to generate packed data if the epilogue is an exact match of the prologue, except for the final set_fp opcode - but that wording actually makes me worry more about the cases where the epilogue does contain that opcode (and rely on it). And the docs say
But I'd say that the code generated by MSVC that I quoted does seem to rely on it.
After updating the other patch, where a "stp x19, x20" can't be made into SaveR19R20X if the offset is too large, I tried update the code here to expect SaveRegPX for x19 - but it turns out it's impossible to make such a case with a large enough offset not to be canonicalized into SaveR19R20X, while maintaining the expected stack layout here, so I just updated the comment here.
The false parameter (to the parameter TryPacked) here was the reason why this didn't match anything for real code generation. With D87448 in place, we don't call this needlessly.
Alternatively, we could remove the TryPacked parameter, and just look at info->HandlesExceptions (as we already do). In that case, a plain .seh_handlerdata wouldn't skip making packed info, but that feels a bit risky (as .seh_handlerdata implies switching to the xdata section, supposedly after emitting the preceding unwind info there). But then again, if there's no unwind handler set, nothing would be able to find the trailing data written to the xdata section anyway? So maybe we could just remove this extra check as well?
I don't see how that would cause miscompiles here. But in any case, that experiment turned out to be less exciting than expected in the end; this patch, when properly working without being blocked by extra .seh_handlerdata, shaves off 2.5 KB from a xdata section of 228 KB, and such a relaxed matcher only got rid of another 2 KB. So for it to get any notable effect, the frame lowering needs to be adjusted anyway. But at least it has *some* effect now.
I'll see if I can get @TomTan to comment on it.
I discussed this with @TomTan, and he clarified that the synthesized epilogue for these cases lacks the corresponding set_fp opcode - because makes it suitable in a lot more cases for the kind of code MSVC often generates.
But he also agreed that the wording in the docs seems to be overly strict and the statement "Packed unwind data can't be used if a function requires restoration of sp from x29." doesn't seem to hold up (and agrees with the reasoning above that it would unwind correctly anyway), and he checked with the original document author who can't remember a reason why it wouldn't work either. So he agrees we can proceed with this patch, generating the packed format even if the epilogue is an exact mirror of the prologue, including the set_fp opcode.