Part of initial Arm64EC patchset.
Variadic function's exit thunk is different from normal function. It need to allocate a dynamic stack allocation on the bottom of stack. Then copy the original arguments on the stack to the new allocation.
When we have a varargs function that returns the value in a register on AArch64, but requires an “sret” return on x64, we need to shuffle around the argument registers, and store x3 to the stack with 8bytes alignment.
Details
- Reviewers
efriedma dpaoliello DavidSpickett
Diff Detail
Event Timeline
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll | ||
---|---|---|
363–365 | It looks Microsoft generate code: sub sp,sp,x15,lsl #4 Does anyone know why we can't accept SP as input/ouput for the instruction? def : Pat<(sub GPR64:$Rn, arith_shifted_reg64:$Rm), (SUBSXrs GPR64:$Rn, arith_shifted_reg64:$Rm)>; |
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp | ||
---|---|---|
483 ↗ | (On Diff #444548) | For the whole "x3 is stored to the stack" thing, if we're going to continue to use an alloca, I'd probably recommend just increasing the size of the alloca by 8 bytes, then explicitly emit a store instruction in the IR. Messing with the alignment like this seems likely to cause issues. |
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll | ||
363–365 | The instruction used by the Microsoft compiler is SUBXrx64. SUBSXrs can't refer to sp that way; the "lsl" is actually an alternate spelling of "uxtx". We could add a specialized pattern specifically for subtraction operations where the first operand of the subtraction is a copy from sp. |
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp | ||
---|---|---|
483 ↗ | (On Diff #444548) | I'm trying to move these code into AArch64TargetLowering::LowerCall but still have some problems about the stack layout. And I also trying to make the IR version also correct. Thanks for the idead of allocate extra 8bytes. |
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll | ||
363–365 | We could add a specialized pattern specifically for subtraction operations where the first operand of the subtraction is a copy from sp. def : Pat<(sub GPR64sp:$SP, arith_extended_reg32to64_i64:$Rm), (SUBSXrx GPR64sp:$SP, arith_extended_reg32to64_i64:$Rm)>; If yes, we may need to add a new select function similar to arith_extended_reg32to64_i64 but remove check extend. Or can we do this on DAGCombine? And I think we can do this on another patch. |
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll | ||
---|---|---|
489–490 | Can be TODO also. |
llvm/lib/Target/AArch64/AArch64Arm64ECCallLowering.cpp | ||
---|---|---|
182 | Maybe make the contribution from RetStack a bit more clear. | |
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll | ||
363–365 | I think you'd want a PatLeaf to specifically look for a CopyFromReg from sp, as opposed to changing the way we lower all subtraction operations, but yes, that's the idea. (You could, alternatively, introduce a target-specific node in AArch64ISelLowering.h specifically to represent "sub sp, sp, xN, lsl #4", use it from LowerDYNAMIC_STACKALLOC, and write a pattern to lower it to SUBXrx64.) And yes, please leave it for another patch. | |
489–490 | Not sure how we're ending up with two separate operations in the first place; I'd normally expect SelectionDAG CSE to kick in. |
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll | ||
---|---|---|
489–490 | It happen in the pass RegisterCoalescer, after the DAG CSE, even after machine CSE. %15:gpr64sp = ADDXri %stack.0, 0, 0 $x0 = COPY %15:gpr64sp $d0 = COPY %15:gpr64sp x0 can rematerialize to ADDXri %stack.0, 0, 0, but d0 can not. |
When I trying to move the memcpy to SelectionDag AArch64TargetLowering::LowerCall, there are two issues I can't fix:
- Load address for __os_arm64x_dispatch_call_no_redirect is after memory copy in IR version. But if we copy memory in LowerCall the load is before memory copy. It cause one more register usage.
- 32 bytes for register store should be the real bottom on the stack but when I move memory copy into LowerCall , the dynamic allocation is always the real bottom on the stack.
@efriedma Do you have any idea to fix them?
Load address for __os_arm64x_dispatch_call_no_redirect is after memory copy in IR version. But if we copy memory in LowerCall the load is before memory copy. It cause one more register usage.
That's a consequence of doing the load in IR, I think. The load is before the memcpy in the initial SelectionDAG, and nothing tries to rearrange them. By default, scheduling happens in source order, and we don't reorder across calls after isel. I don't see any obvious fix; maybe the call lowering code could try to find the load and mess with its chains? But I wouldn't worry about it; if we actually care about the performance of varargs thunks, there are probably more significant improvements we could make, like trying to inline small memcpys.
32 bytes for register store should be the real bottom on the stack but when I move memory copy into LowerCall , the dynamic allocation is always the real bottom on the stack.
Maybe AArch64FrameLowering::hasReservedCallFrame is returning the wrong thing? Normally, the stack allocation for call arguments is allocated in the prologue; the stack frame layout code needs to know if that's illegal because there a dynamic/large/etc. allocation.
Alternatively, you could just make the extra 32 bytes part of the dynamic allocation, instead of trying to do it "properly".
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
1438 | Is this actually necessary? The linker should resolve plain "memcpy" to something reasonable, I think. | |
6596 | In theory, you call CreateVariableSizedObject so that you can use the returned FrameIndex to refer to the object. If you don't actually use the FrameIndex for anything, the call looks sort of silly, sure. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
1438 | Based on my local test, if I don't add # the memcpy will be link into x86 version memcpy and crash at runtime. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
1438 | Does the linking process for arm64ec actually guarantee that we have an arm64ec msvcrt that includes "#memcpy" etc.? Even if it does, I don't think we can make the same assumption for all the other functions SelectionDAG needs to call. Given that, we're going to need some mechanism to allow calls generated by SelectionDAG to participate in thunking. In any case, if this change unblocks testing for you, we can leave it in with a FIXME to address the above. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
1438 | I try to use dumpbin to dump the symbols for vcruntime.lib. And I find those symbols may related: 40 #memchr 41 #memcmp 42 #memcpy 43 #memmove 44 #memset I think we only need to consider the memory intrinsic functions that MSVC also can export. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
1438 | The full set of stuff SelectionDAG can generate includes is basically stuff from RuntimeLibcalls.def, plus a few target-specific bits. (Off the top of my head, not sure if there are any target-specific calls on arm64 windows besides "__chkstk" and "__security_check_cookie".) If we expect that arm64ec code normally links against an arm64ec C runtime, I guess most of the routines we'd want should be available in "#"-prefixed versions, but I'm not sure about all of them... |
Maybe make the contribution from RetStack a bit more clear.