Page MenuHomePhabricator

[X86] Lower calls with clang.arc.attachedcall bundle
ClosedPublic

Authored by fhahn on Jan 13 2021, 6:11 AM.

Details

Summary

This patch adds support for lowering function calls with the
clang.arc.attachedcall bundle. The goal is to expand such calls to the
following sequence of instructions:

callq   @fn
movq  %rax, %rdi
callq   _objc_retainAutoreleasedReturnValue / _objc_unsafeClaimAutoreleasedReturnValue

This sequence of instructions triggers Objective-C runtime optimizations,
hence we want to ensure no instructions get moved in between them.
This patch achieves that by adding a new CALL_RVMARKER ISD node,
which gets turned into the CALL64_RVMARKER pseudo, which eventually gets
expanded into the sequence mentioned above.

The ObjC runtime function to call is determined by the
argument in the bundle, which is passed through as a
target constant to the pseudo.

@ahatanak is working on using this attribute in the front- & middle-end.

Together with the front- & middle-end changes, this should address
PR31925 for X86.

This is the X86 version of 46bc40e50246c1902a1ca7916c8286cb837643ee,
which added similar support for AArch64.

Diff Detail

Event Timeline

fhahn created this revision.Jan 13 2021, 6:11 AM
fhahn requested review of this revision.Jan 13 2021, 6:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2021, 6:11 AM
Herald added a subscriber: wdng. · View Herald Transcript
fhahn added a comment.Jan 13 2021, 6:20 AM

I am not too familiar with how calls are handled in the X86 backend. I would really appreciate any pointers to cases that this approach might miss.

This is only relevant for Objective-C code on X86, so it would probably be OK to drop support for the CALL32* variants?

RKSimon resigned from this revision.Jan 14 2021, 2:06 AM
RKSimon added a reviewer: pengfei.
skan added a subscriber: skan.Jan 18 2021, 6:07 PM
fhahn planned changes to this revision.Jan 21 2021, 1:28 AM

After checking with @ahatanak, we need a slightly different approach on X86 compared to ARM64. I'll update the patch shortly.

fhahn updated this revision to Diff 322138.Feb 8 2021, 9:17 AM

Updated to work with operand bundles as available upstream, emit marker ( movq %rdi, %rax) followed by call to either _objc_retainAutoreleasedReturnValue or _objc_unsafeClaimAutoreleasedReturnValue.

fhahn updated this revision to Diff 322154.Feb 8 2021, 10:08 AM

Fix marker, remove 32bit support, as we will only enable it for x86-64 in the frontend initially.

fhahn retitled this revision from [X86] Lower calls with rv_marker attribute. to [X86] Lower calls with clang.arc.rv bundle.Feb 9 2021, 5:24 AM
fhahn edited the summary of this revision. (Show Details)
fhahn updated this revision to Diff 322992.Feb 11 2021, 6:58 AM

Fix issue with pcrel32 calls, properly update RAX reg state (if it is implicit-def dead on the original call, we need to change it to implicit-def, because we add a new use; also set the state on the generated call); add reg mask for new call.

fhahn updated this revision to Diff 323937.Feb 16 2021, 2:50 AM

Update bundle to clang.arc.attachedcall, in line with committed bundle name.

fhahn retitled this revision from [X86] Lower calls with clang.arc.rv bundle to [X86] Lower calls with clang.arc.attachedcall bundle.Feb 16 2021, 2:50 AM
fhahn edited the summary of this revision. (Show Details)
fhahn updated this revision to Diff 329696.Mar 10 2021, 9:48 AM

Rebase & ping :)

I also added a new test case to make sure implicit operands for call arguments are added correctly to the generated calls.

ab added a comment.Mar 17 2021, 1:33 PM

Can you (or @ahatanak) explain why/how this needs to be different from aarch64? In particular I don't think I understand why you need to emit the retainARV/claimARV calls this late, that seems like quite a drastic difference in behavior for the bundle.
I read through the discussion in D92808 but it sounds like that predates the x86_64-specific behavior, right? But skimming around the various passes involved I'm not actually sure who emits the retainARV/claimARV calls on aarch64.

llvm/lib/Target/X86/X86ISelLowering.cpp
4435

clang.arc.rv -> clang.arc.attachedcall, right? same below

4443

Sel is a bit confusing, especially in an objc context ;)

fhahn added a comment.Mar 17 2021, 1:59 PM

Thanks for taking a look!

In D94597#2632806, @ab wrote:

Can you (or @ahatanak) explain why/how this needs to be different from aarch64? In particular I don't think I understand why you need to emit the retainARV/claimARV calls this late, that seems like quite a drastic difference in behavior for the bundle.

Unfortunately the ObjC runtime matches different sequences on AArc64 and X86. This is documented here https://github.com/opensource-apple/objc4/blob/master/runtime/objc-object.h#L974

On AArch64, the callee only checks for mov x29, x29 in the caller and the position of the claim/autorelease runtime call does not matter, which makes things much simpler. On X86_64, it specifically looks for the marker instructions, immediately followed by the runtime call. So we want to make sure that it is very hard to break up this specific sequence, to avoid missing out on the runtime optimization.

I read through the discussion in D92808 but it sounds like that predates the x86_64-specific behavior, right? But skimming around the various passes involved I'm not actually sure who emits the retainARV/claimARV calls on aarch64.

On AArch64 I think that happens during the lowering of the ARC intrinsics pre-ISel, but @ahatanak would know more about the exact place.

fhahn updated this revision to Diff 332299.Mar 22 2021, 8:34 AM

Update references to arc bundle name, change Sel to RuntimeCallType.

In D94597#2632806, @ab wrote:

I read through the discussion in D92808 but it sounds like that predates the x86_64-specific behavior, right? But skimming around the various passes involved I'm not actually sure who emits the retainARV/claimARV calls on aarch64.

On AArch64 I think that happens during the lowering of the ARC intrinsics pre-ISel, but @ahatanak would know more about the exact place.

ARC contract pass inserts the retainRV/claimRV calls before transforming the function. After the transformation is done, the pass removes the inserted runtime calls on x86-64, but leaves the calls in the IR on AArch64.

On AArch64, objc_autoreleaseReturnValue, which the callee calls before returning, looks for only the marker instruction, so it's not necessary for the retainRV/claimRV instructions to immediately follow the function call and the marker instruction. On x86-64, it looks for a movq first and then a retainRV/claimRV call immediately following the movq.

See callerAcceptsOptimizedReturn in the code below:

https://opensource.apple.com/source/objc4/objc4-818.2/runtime/objc-object.h.auto.html

ab added a comment.EditedApr 27 2021, 10:42 AM

Unfortunately the ObjC runtime matches different sequences on AArc64 and X86. This is documented here https://github.com/opensource-apple/objc4/blob/master/runtime/objc-object.h#L974

On AArch64, the callee only checks for mov x29, x29 in the caller and the position of the claim/autorelease runtime call does not matter, which makes things much simpler. On X86_64, it specifically looks for the marker instructions, immediately followed by the runtime call. So we want to make sure that it is very hard to break up this specific sequence, to avoid missing out on the runtime optimization.

Interesting, I didn't know that! I guess the mov rax/rdi was a natural part of the sequence on x86_64, and the old original runtime check was opportunistically based on the existing codegen.

So, looking at the patch, this looks correct, but I'm still mildly uncomfortable with the sel bit surviving that late. Here's a maybe-terrible idea: would it work if you replaced the i64 0/1 in the bundle with the actual pointer to the called function? e.g.,

%r = call i8* @foo(i64 %c, i64 %b, i64 %a) [ "clang.arc.attachedcall"(i8*(i8*)* @_objc_retainAutoreleasedReturnValue) ]

on aarch64 you just do the bundle with no args (though "attachedcall" may not be the best name at that point); on x86 you make the frontend pass the actual function it wants (edit: though it sounds like the ARC passes would need more of that logic than the frontend?), and the backend doesn't care and just turns the GV into a target globaladdr, and turns it into a call later

fhahn added a comment.Apr 30 2021, 7:09 AM
In D94597#2720350, @ab wrote:

Unfortunately the ObjC runtime matches different sequences on AArc64 and X86. This is documented here https://github.com/opensource-apple/objc4/blob/master/runtime/objc-object.h#L974

On AArch64, the callee only checks for mov x29, x29 in the caller and the position of the claim/autorelease runtime call does not matter, which makes things much simpler. On X86_64, it specifically looks for the marker instructions, immediately followed by the runtime call. So we want to make sure that it is very hard to break up this specific sequence, to avoid missing out on the runtime optimization.

Interesting, I didn't know that! I guess the mov rax/rdi was a natural part of the sequence on x86_64, and the old original runtime check was opportunistically based on the existing codegen.

So, looking at the patch, this looks correct, but I'm still mildly uncomfortable with the sel bit surviving that late. Here's a maybe-terrible idea: would it work if you replaced the i64 0/1 in the bundle with the actual pointer to the called function? e.g.,

%r = call i8* @foo(i64 %c, i64 %b, i64 %a) [ "clang.arc.attachedcall"(i8*(i8*)* @_objc_retainAutoreleasedReturnValue) ]

on aarch64 you just do the bundle with no args (though "attachedcall" may not be the best name at that point); on x86 you make the frontend pass the actual function it wants (edit: though it sounds like the ARC passes would need more of that logic than the frontend?), and the backend doesn't care and just turns the GV into a target globaladdr, and turns it into a call later

Yeah I think that would be a good option. @ahatanak WDYT?

Yes, I think that's better. The verifier currently rejects that IR since taking addresses of intrinsic functions isn't allowed (see https://reviews.llvm.org/D92808#2558928), but probably that can be fixed.

ab accepted this revision.May 11 2021, 10:12 AM

Ah sorry, I didn't see you tried that already! I don't think we want to weaken the verifier check, this is probably the only case where it makes sense to take the address. I guess we've meandered through the same decision process, thanks for the patience ;) But it might be worth explaining these decisions in the commit message. Also in the langref paragraph for the bundle, it says the call is always emitted for the bundle.
Which brings another idea: how about emitting the call when lowering the bundle on AArch64 as well? There's not much room for better regalloc or other optimizations in the sequence anyway, right?

But back to x86: there's a couple things we could do to make the behavior more obvious (split the bundle into distinct retain/claim ones? replace the sel imm MO in these pseudos with an actual GV/ES reference to the symbol?). But I'm not sure that makes a real difference, so LGTM as-is

llvm/lib/Target/X86/X86ExpandPseudo.cpp
245

why "Cache"?

llvm/test/CodeGen/X86/call-rv-marker.ll
161

Hmm, I'm curious, do you know why this doesn't trigger? Is it just missing entries in the fold tables? (does the extra operand make that trickier than it should be?)

This revision is now accepted and ready to land.May 11 2021, 10:12 AM
fhahn added a comment.Fri, May 21, 8:31 AM

Thanks Ahmed! I'll address your comment and then I'll land it. I expect that we will probably be making a couple of improvements once this lands and gets wider test coverage

llvm/lib/Target/X86/X86ExpandPseudo.cpp
245

I'm not sure any more, I'll rename it to Fn :)

llvm/test/CodeGen/X86/call-rv-marker.ll
161

Yes I think there's an entry missing to optimize this. I'll investigate that separately.

This revision was landed with ongoing or failed builds.Fri, May 21, 8:34 AM
This revision was automatically updated to reflect the committed changes.