This is an archive of the discontinued LLVM Phabricator instance.

[Arm64EC 6/?] Implement C/C++ mangling for Arm64EC function definitions.
Needs ReviewPublic

Authored by efriedma on May 11 2022, 1:52 PM.

Details

Summary

Part of initial Arm64EC patchset.

For the Arm64EC ABI, ARM64 functions have an alternate name. For C code, this name is just the original name prefixed with "#". For C++ code, we stick a "$$h" modifier in the middle of the mangling.

For functions which are not hybrid_patchable, the normal name is then an alias for the alternate name. (For functions that are patchable, we have to do something more complicated to tell the linker to generate a stub; I haven't tried to implement that yet.)

This doesn't emit quite the same symbols table as MSVC for simple cases: MSVC generates a IMAGE_WEAK_EXTERN_ANTI_DEPENDENCY alias, where this just makes another symbol pointing at the function definition. This probably matters for the hybmp$x table, but I don't have the complete documentation at the moment.

Diff Detail

Event Timeline

efriedma created this revision.May 11 2022, 1:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2022, 1:52 PM
efriedma requested review of this revision.May 11 2022, 1:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2022, 1:52 PM
bcl5980 added inline comments.
clang/lib/CodeGen/CodeGenModule.cpp
5300

A headache thing here.
We need to get the function definition with triple x64 to define entry thunk. For now the function definition here is aarch64 version.
For example the case in Microsoft doc "Understanding Arm64EC ABI and assembly code":

struct SC {
    char a;
    char b;
    char c;
};
int fB(int a, double b, int i1, int i2, int i3);
int fC(int a, struct SC c, int i1, int i2, int i3);
int fA(int a, double b, struct SC c, int i1, int i2, int i3) {
    return fB(a, b, i1, i2, i3) + fC(a, c, i1, i2, i3);
}

x64 version IR for fA is:

define dso_local i32 @fA(i32 noundef %a, double noundef %b, ptr nocapture noundef readonly %c, i32 noundef %i1, i32 noundef %i2, i32 noundef %i3) local_unnamed_addr #0 { ... }

aarch64 version IR for fA is:

define dso_local i32 @"#fA"(i32 noundef %a, double noundef %b, i64 %c.coerce, i32 noundef %i1, i32 noundef %i2, i32 noundef %i3) #0 {...}

Arm64 will allow any size structure to be assigned to a register directly. x64 only allows sizes 1, 2, 4 and 8.
Entry thunk follow x64 version function type. But we only have aarch64 version function type.

I think the best way to do is create a x64 version codeGenModule and use the x64 CGM to generate the function type for entry thunk. But it is hard for me to do here. I tried a little but a lot of issues happen.

One other way is only modify AArch64ABIInfo::classifyArgumentType, copy the x64 code into the function and add a flag to determine which version will the function use. It is easier but I'm not sure it is the only difference between x64 and aarch64. Maybe the classify return also need to do this. And it is not a clean way I think.

efriedma added inline comments.Jul 19 2022, 10:31 AM
clang/lib/CodeGen/CodeGenModule.cpp
5300

Oh, that's annoying... I hadn't considered the case of a struct of size 3/5/6/7.

Like I noted on D126811, attaching thunks to calls is tricky if we try to do it from clang.

Computing the right IR type shouldn't be that hard by itself; we can call into call lowering code in TargetInfo without modifying much else. (We just need a bit to tell the TargetInfo to redirect the call, like D125419. Use an entry point like CodeGenTypes::arrangeCall.) You don't need to mess with the type system or anything like that.

The problem is correctly representing the lowered call in IR; we really don't want to do lowering early because it will block optimizations. I considered using an operand bundle; we can probably make that work, but it's complicated, and probably disables some optimizations.

I think the best thing we can do here is add an IR attribute to mark arguments which are passed directly on AArch64, but need to be passed indirectly for the x64 ABI. Then AArch64Arm64ECCallLowering can check for the attribute and modify its behavior. This isn't really clean in the sense that it's specific to the x64/aarch64 pair of calling conventions, but I think the alternative is worse.

bcl5980 added inline comments.Aug 9 2022, 9:52 PM
clang/lib/CodeGen/CodeGenModule.cpp
5300

It looks not only 3/5/6/7, but also all size exclusive larger than 8 and less than 16 are difference between x86 ABI and Aarch64 ABI.
Maybe we can emit a function declaration here for the x86ABI thunk, then define it in Arm64ECCallLowering.

efriedma added inline comments.Aug 10 2022, 1:07 PM
clang/lib/CodeGen/CodeGenModule.cpp
5300

I think the sizes between 8 and 16 work correctly already? All sizes greater than 8 are passed indirectly on x86, and the thunk generation code accounts for that. But that's not really important for the general question.

We need to preserve the required semantics for both the AArch64 and x86 calling conventions. There are basically the following possibilities:

  • We compute the declaration of the thunk in the frontend, and attach it to the call with an operand bundle. Like I mentioned, I don't want to go down this path: the operand bundle blocks optimizations, and it becomes more complicated for other code to generate arm64ec compatible calls.
  • We don't compute the definition of the thunk in the frontend. Given that, the only other way to attach the information we need to the call is to use attributes. The simplest thing is probably to attach the attribute directly to the argument; name it "arm64ec-thunk-pass-indirect", or something like that. (I mean, we could compute the whole signature and stuff it into a string attribute, but that doesn't really seem like an improvement...)
bcl5980 added inline comments.Aug 10 2022, 7:24 PM
clang/lib/CodeGen/CodeGenModule.cpp
5300

I think the sizes between 8 and 16 work correctly already? All sizes greater than 8 are passed indirectly on x86, and the thunk generation code accounts for that.

Yeah, current code for exit thunk already account for that. I mean we need to mark the parameter because entry thunk behavior is also different.

Maybe we can compute the mangle name like $iexit_thunk$cdecl$i8$m6 or $ientry_thunk$cdecl$m16$f for the thunk function. Then set attributes like

"arm64ec-exitthunk"="$iexit_thunk$cdecl$i8$m6"
"arm64ec-entrythunk"="$ientry_thunk$cdecl$m16$f"

to the function.
Based on the mangle name we can restore the whole thunk I think. This should be a little easier.

efriedma added inline comments.Aug 11 2022, 12:41 PM
clang/lib/CodeGen/CodeGenModule.cpp
5300

Each function has an arm64 function signature, and a corresponding x64 signature. The frontend always generates the function with the arm64 signature, and thunk generation translates that to the x64 signature. That part is the same whether we're generating an entry thunk, or an exit thunk. So I'm not sure why you're distinguishing between them in this context.

I'm not sure it makes sense to force the frontend to generate the mangled form, then make the backend demangle it. Seems more straightforward to just attach an attribute to an argument, and make the backend generate the mangled form?

bcl5980 added inline comments.Aug 11 2022, 8:00 PM
clang/lib/CodeGen/CodeGenModule.cpp
5300

Each function has an arm64 function signature, and a corresponding x64 signature. The frontend always generates the function with the arm64 signature, and thunk generation translates that to the x64 signature. That part is the same whether we're generating an entry thunk, or an exit thunk. So I'm not sure why you're distinguishing between them in this context.

I mean which arguments need to be marked is different for the entry thunk and exit thunk.
Both entry thunk and exit thunk need to mark argument with size 3/5/6/7.
But when the size is larger than 8 and less than 16, entry thunk still need to mark it but exit thunk needn't.
So if we attach an attribute to an argument we need to consider the case larger than 8 less than 16 also.
Because when a function has an argument with size 15bytes, frontend will coerce it to i64x2. If we don't attach an attribute for it , backend can't generate the correct entry thunk as we already loss the real size of the argument. Exit thunk needn't that because the code for 15bytes and 16 bytes is the same, store i64x2 to the memory them pass the address.

This is part of 15bytes entry thunk

	mov         fp,sp
	mov         x10,x1
	ldr         w1,[x10,#8]
	mov         x19,x0
	ldur        w8,[x10,#0xB]
	ldr         x0,[x10]
	bfi         x1,x8,#0x18,#0x20
	blr         x9

This is part of 16bytes entry thunk

	mov         fp,sp
	mov         x8,x1
	mov         x19,x0
	ldp         x0,x1,[x8]
	blr         x9
efriedma added inline comments.Aug 12 2022, 12:09 PM
clang/lib/CodeGen/CodeGenModule.cpp
5300

Oh, I see what you mean. The size of the memory is the same either way according to the ABI, but we're "cheating" a bit with exit thunks: nothing cares if we allocate extra memory for entry thunks, so we can emit slightly shorter code. But for entry thunks, if we read past the end, we could cause a fault.

(Realistically, we could probably get away with reading 16 bytes for entry thunks, and not the "right" amount. In practice, indirect arguments always point to memory on the stack, and there's always going to be something on the stack after that, so reading past the end will never fault. But if MSVC is conservative here, maybe we should be too.)

(On a related side-note, if we do want to generate the conservative sequence, the code MSVC is generating here is sort of inefficient; something like "ldur x1, [x1, #7]; lshr x1, x1, #8" is going to be faster than ldr+ldr+bfi.)

I don't think that means we need the frontend to generate different markings depending on whether we're dealing with entry or exit thunks, though. The thunk generation code can handle the difference transparently if we don't make the frontend mangle the thunk signature.

Another thing we need consider here is this case:

#pragma pack(push, 1)
    struct b64 {
        char a[64];
    };
#pragma pack(pop)

    typedef b64 (fptrtype)(int a);

    b64 f(void* p, int a) {
        return ((fptrtype*)p)(a);
    }

For now we generate exit_thunk with type void f(void* sret(b64) ret, int a)

$iexit_thunk$cdecl$v$i8i8:              // @"$iexit_thunk$cdecl$v$i8i8"
.seh_proc $iexit_thunk$cdecl$v$i8i8
// %bb.0:
	sub	sp, sp, #48
	.seh_stackalloc	48
	stp	x29, x30, [sp, #32]             // 16-byte Folded Spill
	.seh_save_fplr	32
	add	x29, sp, #32
	.seh_add_fp	32
	.seh_endprologue
	mov	w1, w0
	mov	x0, x8
	adrp	x8, __os_arm64x_dispatch_call_no_redirect
	ldr	x8, [x8, :lo12:__os_arm64x_dispatch_call_no_redirect]
	blr	x8
	.seh_startepilogue
	ldp	x29, x30, [sp, #32]             // 16-byte Folded Reload
	.seh_save_fplr	32
	add	sp, sp, #48
	.seh_stackalloc	48
	.seh_endepilogue
	ret
	.seh_endfunclet
	.seh_endproc
                                        // -- End function
	.globl	f
	.def	f;
	.scl	2;
	.type	32;
	.endef

But it looks Microsoft generate exit thunk with type void* f(int a)

|$iexit_thunk$cdecl$i8$i8| PROC
|$LN2|
	pacibsp
	stp         fp,lr,[sp,#-0x10]!
	mov         fp,sp
	sub         sp,sp,#0x20
	adrp        x8,__os_arm64x_dispatch_call_no_redirect
	ldr         xip0,[x8,__os_arm64x_dispatch_call_no_redirect]
	blr         xip0
	mov         x0,x8
	add         sp,sp,#0x20
	ldp         fp,lr,[sp],#0x10
	autibsp
	ret

	ENDP  ; |$iexit_thunk$cdecl$i8$i8|

But based on clang x86 on Windows, we also generate the function type with void f(void* sret(b64) ret, int a).
It looks clang is different from MSVC even in x86 ABI.
Do we need to follow MSVC to generate $iexit_thunk$cdecl$i8$i8 ? Or just follow clang's ABI and ignore the difference?

There's no way the calling convention can change based on whether you're calling a function vs. a function pointer. I can't explain why MSVC is generating different code. I think we should just ignore it, at least for now.

There's no way the calling convention can change based on whether you're calling a function vs. a function pointer. I can't explain why MSVC is generating different code. I think we should just ignore it, at least for now.

It's OK for me to ignore the difference but I think the main thing is not function or function pointer. It's how to generate the exit thunkwhen return with structure size value > 16.
https://godbolt.org/z/MWv4YaKdK
Three different way to call extern function, with three kind of exit thunks. All of them are keep the return value, not move the return value' point to the first argument.

The reason struct returns require register shuffling is that AArch64 passes the sret pointer in x8 (i.e. RAX), but the x64 calling convention expects in in RCX (i.e. x0).

Have you tried to see if the Microsoft-generated thunk actually works? I found at least one bug in MSVC thunk generation and reported it to Microsoft. (Microsoft didn't acknowledge the report, but that's a different story...)

The reason struct returns require register shuffling is that AArch64 passes the sret pointer in x8 (i.e. RAX), but the x64 calling convention expects in in RCX (i.e. x0).

So, for the function: s64 f(int a):
AArch64 CC: void f(x8, x0)
X64 CC: void f(rcx[x0], rdx[x1])
AArch64 --> X64 we need to add instructions before blr

mov x1, x0
mov x0, x8

It can match iexit_thunk$cdecl$m64$i8 when we call extern function not a function pointer.

Have you tried to see if the Microsoft-generated thunk actually works? I found at least one bug in MSVC thunk generation and reported it to Microsoft. (Microsoft didn't acknowledge the report, but that's a different story...)

You are right. For now, I haven't tested too much case runtime. But it looks if a DLL import function pass to a function pointer, then call it will cause access violation.
Based on the debug result, it should be exit thunk issue, MSVC generate wrong thunk type.

I think I'd like to continue moving forward with approximately this approach, at least for the moment. As far as I know, D132926 solves the remaining issues with translating the calling conventions. (I'll try to review D132926 soon.)

I think this looks reasonable to me, but I don't think I'm knowledgeable enough to give this a proper review, sorry.

A question about the mangle and alias part.
Should we move the code to create alias to backend also?Sometimes we will emit the alias here but later the function will be inlined or eliminated by DCE.
And later we need to emit alias for direct call thunk also, like $originname$exitthunk. Put all of them into arm64eccalllowering pass should be better I think.

Sometimes we will emit the alias here but later the function will be inlined or eliminated by DCE.

If the alias is externally visible, it can't be eliminated; the compiler can't tell whether the symbol is referenced. If the alias isn't externally visible, it's dead from the outset. Not sure how this could become an issue.

And later we need to emit alias for direct call thunk also, like $originname$exitthunk.

Direct call thunks aren't directly relevant here; we only emit them for declarations, not definitions. I guess this does imply that we need to teach arm64eccalllowering how to modify mangled symbol names... and we could use that same code to insert the $$h.

Put all of them into arm64eccalllowering pass should be better I think.

I really don't want to do demangling in arm64eccalllowering. But looking at the generated patterns a bit more closely, maybe we don't have to fully parse the mangled symbol. If we can get away with just parsing the "?symbolname@@" at the beginning of the symbol, and ignore all the type-related stuff, I guess that would be okay.

Alternatively, I guess we could use attributes to communicate the different mangled forms to the backend, but probably better to avoid that if we can.

If we can solve the mangling issues, I guess generating the alias in arm64eccalllowering would be fine.

bcl5980 added a comment.EditedSep 22 2022, 4:18 PM

Sometimes we will emit the alias here but later the function will be inlined or eliminated by DCE.

If the alias is externally visible, it can't be eliminated; the compiler can't tell whether the symbol is referenced. If the alias isn't externally visible, it's dead from the outset. Not sure how this could become an issue.

There will be no functional issue here. I mean that we can avoid generate some redundant alias if it is in the arm64eccalllowering.

And later we need to emit alias for direct call thunk also, like $originname$exitthunk.

Direct call thunks aren't directly relevant here; we only emit them for declarations, not definitions. I guess this does imply that we need to teach arm64eccalllowering how to modify mangled symbol names... and we could use that same code to insert the $$h.

Put all of them into arm64eccalllowering pass should be better I think.

I really don't want to do demangling in arm64eccalllowering. But looking at the generated patterns a bit more closely, maybe we don't have to fully parse the mangled symbol. If we can get away with just parsing the "?symbolname@@" at the beginning of the symbol, and ignore all the type-related stuff, I guess that would be okay.

Alternatively, I guess we could use attributes to communicate the different mangled forms to the backend, but probably better to avoid that if we can.

If we can solve the mangling issues, I guess generating the alias in arm64eccalllowering would be fine.

As far as I know, there are three kinds of alias we need to generate, For example:

extern "C" void function_name(void a)
    arm64 signature: #function_name(native)

If it is function definition, we need to create an alias to be x86 signature, and mangle a new name to arm64ec signature. That is what this change do.

x86 signature: function_name

If it is a function direct call case, we need to create two alias:

function thunk: #function_name$exit_thunk
x86 signature: function_name

I don't understand too much why we need to demangle the function name in arm64eccallovering. It looks what we need to do is generate the arm64 signature name from the default symbol name which is x86 signature by default.
I'm not familiar with normal mangle rules. If # and $$h are unique, maybe we can just insert # on the beginning for C symbol name or insert $$h after first`@@` for C++ symbol name. Like:

if (MangleName._Starts_with("?")) {
  size_t InsertIdx = MangleName.find("@@");
  if (InsertIdx != std::string::npos)
    MangleName.insert(InsertIdx + 2, "$$h");
} else {
  MangleName.insert(0, "#");