This is an archive of the discontinued LLVM Phabricator instance.

[ARM64EC 11/?] Add support for lowering variadic indirect calls.
Needs ReviewPublic

Authored by bcl5980 on Jul 13 2022, 8:38 PM.

Details

Summary

Part of initial Arm64EC patchset.
Variadic function's exit thunk is different from normal function. It need to allocate a dynamic stack allocation on the bottom of stack. Then copy the original arguments on the stack to the new allocation.
When we have a varargs function that returns the value in a register on AArch64, but requires an “sret” return on x64, we need to shuffle around the argument registers, and store x3 to the stack with 8bytes alignment.

Diff Detail

Event Timeline

bcl5980 created this revision.Jul 13 2022, 8:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 8:38 PM
bcl5980 requested review of this revision.Jul 13 2022, 8:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 8:38 PM
bcl5980 added inline comments.Jul 13 2022, 9:26 PM
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll
363–365

It looks Microsoft generate code:

sub         sp,sp,x15,lsl #4

Does anyone know why we can't accept SP as input/ouput for the instruction?

def : Pat<(sub GPR64:$Rn, arith_shifted_reg64:$Rm),
          (SUBSXrs GPR64:$Rn, arith_shifted_reg64:$Rm)>;
bcl5980 updated this revision to Diff 444546.Jul 14 2022, 12:56 AM
bcl5980 updated this revision to Diff 444548.Jul 14 2022, 12:59 AM

fix build error

efriedma added inline comments.Jul 14 2022, 10:04 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
483 ↗(On Diff #444548)

For the whole "x3 is stored to the stack" thing, if we're going to continue to use an alloca, I'd probably recommend just increasing the size of the alloca by 8 bytes, then explicitly emit a store instruction in the IR. Messing with the alignment like this seems likely to cause issues.

llvm/test/CodeGen/AArch64/arm64ec-cfg.ll
363–365

The instruction used by the Microsoft compiler is SUBXrx64. SUBSXrs can't refer to sp that way; the "lsl" is actually an alternate spelling of "uxtx".

We could add a specialized pattern specifically for subtraction operations where the first operand of the subtraction is a copy from sp.

bcl5980 added inline comments.Jul 14 2022, 11:47 PM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
483 ↗(On Diff #444548)

I'm trying to move these code into AArch64TargetLowering::LowerCall but still have some problems about the stack layout. And I also trying to make the IR version also correct. Thanks for the idead of allocate extra 8bytes.

llvm/test/CodeGen/AArch64/arm64ec-cfg.ll
363–365

We could add a specialized pattern specifically for subtraction operations where the first operand of the subtraction is a copy from sp.
Do you mean add pattern in td file?

def : Pat<(sub GPR64sp:$SP, arith_extended_reg32to64_i64:$Rm),
          (SUBSXrx GPR64sp:$SP, arith_extended_reg32to64_i64:$Rm)>;

If yes, we may need to add a new select function similar to arith_extended_reg32to64_i64 but remove check extend. Or can we do this on DAGCombine?

And I think we can do this on another patch.

bcl5980 updated this revision to Diff 444899.Jul 15 2022, 12:29 AM

update based on efriedma's suggestion.

bcl5980 marked an inline comment as not done.Jul 15 2022, 1:25 AM
bcl5980 added inline comments.
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll
489–490

Can be TODO also.
sub x0, x29, #48 rematerialize from copy x10
and fmov d0, x10 can't rematerialize so sub x10, x29, #48 remain.
How could we improve the reMaterializeTrivialDef to improve the code?

bcl5980 marked an inline comment as not done.Jul 15 2022, 1:26 AM
efriedma added inline comments.Jul 15 2022, 11:26 AM
llvm/lib/Target/AArch64/AArch64Arm64ECCallLowering.cpp
182

Maybe make the contribution from RetStack a bit more clear.

llvm/test/CodeGen/AArch64/arm64ec-cfg.ll
363–365

I think you'd want a PatLeaf to specifically look for a CopyFromReg from sp, as opposed to changing the way we lower all subtraction operations, but yes, that's the idea.

(You could, alternatively, introduce a target-specific node in AArch64ISelLowering.h specifically to represent "sub sp, sp, xN, lsl #4", use it from LowerDYNAMIC_STACKALLOC, and write a pattern to lower it to SUBXrx64.)

And yes, please leave it for another patch.

489–490

Not sure how we're ending up with two separate operations in the first place; I'd normally expect SelectionDAG CSE to kick in.

bcl5980 added inline comments.Jul 16 2022, 5:41 AM
llvm/test/CodeGen/AArch64/arm64ec-cfg.ll
489–490

It happen in the pass RegisterCoalescer, after the DAG CSE, even after machine CSE.
The machine IR is:

%15:gpr64sp = ADDXri %stack.0, 0, 0
$x0 = COPY %15:gpr64sp
$d0 = COPY %15:gpr64sp

x0 can rematerialize to ADDXri %stack.0, 0, 0, but d0 can not.

bcl5980 updated this revision to Diff 445235.Jul 16 2022, 8:05 AM
bcl5980 updated this revision to Diff 445236.Jul 16 2022, 8:11 AM

When I trying to move the memcpy to SelectionDag AArch64TargetLowering::LowerCall, there are two issues I can't fix:

  1. Load address for __os_arm64x_dispatch_call_no_redirect is after memory copy in IR version. But if we copy memory in LowerCall the load is before memory copy. It cause one more register usage.
  2. 32 bytes for register store should be the real bottom on the stack but when I move memory copy into LowerCall , the dynamic allocation is always the real bottom on the stack.

@efriedma Do you have any idea to fix them?

Load address for __os_arm64x_dispatch_call_no_redirect is after memory copy in IR version. But if we copy memory in LowerCall the load is before memory copy. It cause one more register usage.

That's a consequence of doing the load in IR, I think. The load is before the memcpy in the initial SelectionDAG, and nothing tries to rearrange them. By default, scheduling happens in source order, and we don't reorder across calls after isel. I don't see any obvious fix; maybe the call lowering code could try to find the load and mess with its chains? But I wouldn't worry about it; if we actually care about the performance of varargs thunks, there are probably more significant improvements we could make, like trying to inline small memcpys.

32 bytes for register store should be the real bottom on the stack but when I move memory copy into LowerCall , the dynamic allocation is always the real bottom on the stack.

Maybe AArch64FrameLowering::hasReservedCallFrame is returning the wrong thing? Normally, the stack allocation for call arguments is allocated in the prologue; the stack frame layout code needs to know if that's illegal because there a dynamic/large/etc. allocation.

Alternatively, you could just make the extra 32 bytes part of the dynamic allocation, instead of trying to do it "properly".

bcl5980 updated this revision to Diff 446092.Jul 20 2022, 2:49 AM

SelectionDag version.

bcl5980 updated this revision to Diff 446094.Jul 20 2022, 2:53 AM

Diff 6 445236 is the Final IR version
Diff 8 446094 is SelectionDag version

bcl5980 updated this revision to Diff 446396.Jul 21 2022, 2:17 AM
bcl5980 updated this revision to Diff 452119.Aug 12 2022, 2:40 AM

fix struct return with larger than 16bytes crash

bcl5980 updated this revision to Diff 457503.Sep 1 2022, 10:28 PM
bcl5980 updated this revision to Diff 458083.Sep 5 2022, 7:58 PM

fix crash in non-opaque pointer mode

efriedma added inline comments.Sep 13 2022, 4:31 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1438

Is this actually necessary? The linker should resolve plain "memcpy" to something reasonable, I think.

6596

In theory, you call CreateVariableSizedObject so that you can use the returned FrameIndex to refer to the object. If you don't actually use the FrameIndex for anything, the call looks sort of silly, sure.

bcl5980 added inline comments.Sep 26 2022, 2:57 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1438

Based on my local test, if I don't add # the memcpy will be link into x86 version memcpy and crash at runtime.

efriedma added inline comments.Sep 26 2022, 12:34 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1438

Does the linking process for arm64ec actually guarantee that we have an arm64ec msvcrt that includes "#memcpy" etc.? Even if it does, I don't think we can make the same assumption for all the other functions SelectionDAG needs to call.

Given that, we're going to need some mechanism to allow calls generated by SelectionDAG to participate in thunking.

In any case, if this change unblocks testing for you, we can leave it in with a FIXME to address the above.

bcl5980 added inline comments.Sep 26 2022, 7:19 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1438

I try to use dumpbin to dump the symbols for vcruntime.lib. And I find those symbols may related:

40 #memchr
41 #memcmp
42 #memcpy
43 #memmove
44 #memset

I think we only need to consider the memory intrinsic functions that MSVC also can export.
We can also add these intrinsic's exit thunk in AArch64TargetLowering::LowerCall.

efriedma added inline comments.Sep 27 2022, 10:01 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1438

The full set of stuff SelectionDAG can generate includes is basically stuff from RuntimeLibcalls.def, plus a few target-specific bits. (Off the top of my head, not sure if there are any target-specific calls on arm64 windows besides "__chkstk" and "__security_check_cookie".)

If we expect that arm64ec code normally links against an arm64ec C runtime, I guess most of the routines we'd want should be available in "#"-prefixed versions, but I'm not sure about all of them...

bcl5980 updated this revision to Diff 466268.Oct 8 2022, 1:14 AM

address comments

bcl5980 marked 5 inline comments as done.Oct 8 2022, 1:16 AM