This is an archive of the discontinued LLVM Phabricator instance.

[X86] pr51000 in-register struct return tailcalling
ClosedPublic

Authored by urnathan on Jul 12 2021, 4:53 AM.

Details

Summary

This addresses pr51000 -- a failure to tailcall an object factory function. The underlying issue is tailcalling to or from functions that return a structure.

The x86 backend -- in common with many others -- disables tailcalling when either the caller or callee return a struct (regardless of whether the struct is in registers or via additional out parameter). I suspect this is from when the middle end was less smart, and didn't handle the return object lifetime well. However, AFAICT, it is now smarter, and its tailcallable optimizations do account for that object's lifetime -- just like any other local object or pointer parameter. Plus gone are the days of ABIs that returned structs in globals! (at least I hope so). However, I'm not sure whether there are x86 ABIs that return structs via a pointer in an abn
ormal register that the middle end doesn't know about? I'm not sure how such an implementation would be representable without actually modeling at least a proxy for that register in the middle end, as otherwise it wouldn't know about liveness or escaping of the object to which that register ends up pointing?

The x86 backend only tries to tailcall fns the middle end has marked as tail calls, and then it checks for additional constraints like not being able to tail call through the PLT, or variadic fns in all cases.

So, just removing the struct restriction appears to be sufficient. Obviously this affects the existing sibcall.ll test, and I added a new C++ testcase to check that we generate the expected code in the 51000 cases, and some closely related cases (and do not tail call when the callee returns a large struct into the caller's stack frame).

I bootstrapped an x86_64-linux release compiler and observed that the stage 2 and stage 3 compilers have exactly the same size text, data and bss. Both compilers validate, of course.

pr51000 hypothesizes the issue is arch independent -- that is incorrect. Each code generator that manifests this issue has the same restriction. I've only looked at x86 for the moment.

Diff Detail

Unit TestsFailed

Event Timeline

urnathan created this revision.Jul 12 2021, 4:53 AM
urnathan requested review of this revision.Jul 12 2021, 4:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2021, 4:53 AM

Thank you so much for this!

Quick question, when the backend disables tail-calling, how does that integrate with the new [[clang::musttail]] attribute?
Do we correctly reject the program in that case, or are we silently dropping the tail call?

Quick question, when the backend disables tail-calling, how does that integrate with the new [[clang::musttail]] attribute?
Do we correctly reject the program in that case, or are we silently dropping the tail call?

that logic looks, at best, suspicious -- but I've not altered it.
if (isTailCall && !IsMustTail) { ... check arch-specific constraints maybe clear isTailCall ... }
if (IsMustTail && !isTailCall) { .. error ..}
As you can see, when 'musttail' is in play we never check that ISA can actually do it.
I don't know the semantics of musttail though -- is it 'I, the programmer, assert this is ok to tailcall', or 'I, the programmer, expect a correct tail call here'.
If it is the former, the logic is sound, if it is the latter, it is not. But I think that's an orthogonal issue.

that logic looks, at best, suspicious -- but I've not altered it.
if (isTailCall && !IsMustTail) { ... check arch-specific constraints maybe clear isTailCall ... }
if (IsMustTail && !isTailCall) { .. error ..}
As you can see, when 'musttail' is in play we never check that ISA can actually do it.
I don't know the semantics of musttail though -- is it 'I, the programmer, assert this is ok to tailcall', or 'I, the programmer, expect a correct tail call here'.
If it is the former, the logic is sound, if it is the latter, it is not. But I think that's an orthogonal issue.

It definitely means there *must* be a tail call here, otherwise we will use unbounded amount of stack :)
Ping @haberman, who worked on the DR which implemented this: https://reviews.llvm.org/D99517

I know it's orthogonal, but when I saw your patch it immediately clicked me that there is something off here.

Apologies if this is more of a LLVM musttail issue rather than clang's, and if this was already known limitation. But I feel that it is dangerous to have this attribute but then not be able to enforce it...

Thanks!

When I implemented [[clang::musttail]], I wasn't aware that there were cases like this, where musttail in the IR would pass IR verification, but then a tail call would fail to generate in the backend. The IR documentation for musttail doesn't warn about this case.

Since implementing the attribute, I've become aware of other examples of this problem. For example, PowerPC fails to generate tail calls in several cases, throwing an error like:

fatal error: error in backend: failed to perform tail call elimination on a call site marked musttail

It definitely means there *must* be a tail call here, otherwise we will use unbounded amount of stack :)

I agree with this, [[clang::musttail]] is supposed to provide a guarantee.

I'm not sure the right way to proceed here. Clang should issue a diagnostic and fail if a tail call is not possible. But only the backend knows authoritatively if a tail call is possible or not.

Do we need a way to query the backend? eg. expose IsEligibleForTailCallOptimization() or similar to front-ends like Clang?

urnathan added a comment.EditedJul 12 2021, 9:28 AM

Do we need a way to query the backend? eg. expose IsEligibleForTailCallOptimization() or similar to front-ends like Clang?

In general that's not possible until we've done code generation. (a) the middle end is responsible for determing a bunch of 'can tail-call' pieces (like local variable liveness at the point of the call). The backend is responsible for determining other items (like 'this is a cross-SO call'. Heck, code generation might not be being done in the same process as the compiler invocation -- LTO is a thing :).

I think there are two types of tail-call failure.
a) middle-end says no -- eg, a local stack-resident variable is live across the call.
b) the back-end says no -- eg, we have a cross-SO call here (and the got-pointer is callee-save).

ETA: darn, forgot to mention, the sibcall optimization was the first optimization I ever implemented in a compiler!

re: [[clang::musttail]]: if the C++-level types match, musttail should always succeed. If it doesn't, that's generally a backend bug. Note that the [[clang::musttail]] rules imply the sret markings match. (The old PowerPC ABI has complicated issues with tail calls, though...)

llvm/test/CodeGen/X86/sibcall.ll
986

Are you sure it's legal to transform t21_sret_to_non_sret?

From the x86_64 ABI document: "If the type has class MEMORY [...] On return %rax will contain the address that has been passed in by the caller in %rdi." t21_f_non_sret won't do that.

a) middle-end says no -- eg, a local stack-resident variable is live across the call.

What is an example of this? The musttail attribute requires that it is immediately followed by a ret, so I can't think of any way to trigger this case that would pass IR verification.

The Clang code for musttail already detects and rejects the case where a variable with a non-trivial destructor is in scope at the call site: https://github.com/llvm/llvm-project/blob/072669521456a369409cf9db30739a3fac740173/clang/test/SemaCXX/attr-musttail.cpp#L58-L62

b) the back-end says no -- eg, we have a cross-SO call here (and the got-pointer is callee-save).

This is the case I am worried about. I don't know how we detect this in the front-end, since it's arch-dependent whether cross-SO calls will work.

efriedma added inline comments.Jul 12 2021, 8:59 PM
llvm/test/CodeGen/X86/sibcall.ll
979

Oh, now I remember what happened the last time someone tried to fix this... t21_sret_to_sret_structs_mismatch() is also illegal to transform, for similar reasons to t21_sret_to_non_sret().

urnathan added inline comments.Jul 13 2021, 7:01 AM
llvm/test/CodeGen/X86/sibcall.ll
979

Ah, I think I understand now -- I was thinking this would be UB, but I can see how it might be legit. Let me walk though my understanding.

define fastcc void @t21_sret_to_sret_structs_mismatch(%struct.foo* noalias sret(%struct.foo) %agg.result, %struct.foo* noalias %a) nounwind {
we receive an sret as %agg.result and a explicit pointer in %a (to uninitialized memory? or is that too C++-specific?).

%b = call fastcc %struct.foo* @ret_struct()
this gives us a pointer to a foo in %b [aside, this seems superfluous to the xform being tested?]

tail call fastcc void @t21_f_sret2(%struct.foo* noalias sret(%struct.foo) %a, %struct.foo* noalias %b) nounwind
We call t21_f_sret2 passing it %a as its incoming sret pointer and %b as an explicit arg. It'll return %a, not our incoming sret.

I thought this would be UB, because we're not initializing our own return value -- in C++ we should be calling a constructor. but (a) we could have inlined that and (b) it could be trivial and do nothing. That was my logical error.

So, I think
a) if we're not an sret function we can tail call[*],
b) else if we're calling a non-sret function we cannot tail call,
c) else if we're passing our sret register as the callee's sret register we can tail call[*],
d) else we cannot.

  • provided the remaining checks are ok
efriedma added inline comments.Jul 13 2021, 11:28 AM
llvm/test/CodeGen/X86/sibcall.ll
979

That set of rules looks right.

For the testcase, a C equivalent would be something like this:

struct A { int x[10]; };
__attribute((noinline)) struct A t21_f_sret2(struct A *__restrict x) {
  return (struct A){0};
}
struct A t21_sret_to_sret_structs_mismatch(struct A a[__restrict static 1]) {
  struct A r = {};
  *a = t21_f_sret2(0);
  return r;
}

If you run this through clang -O2, you get IR almost identical to this testcase. (The result is missing a "tail" marker, but that's just a missed optimization; we're missing a run of tailcallelim in the normal pass pipeline.)

urnathan updated this revision to Diff 359019.Jul 15 2021, 9:45 AM

Here is an updated patch with a simpler goal. It only prevents sibcalling when the caller is an sret pointer function. To enable that to sibcall we need to trace the data flow to the callee's sret parameter, and that doesn't appear easy -- at least as a first foray into llvm's backend. So let's just have this smaller improvement, which does solve the problem the original bug report had.

efriedma added inline comments.Jul 15 2021, 1:15 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
4848

I feel like I must be blind here, but this is what we used to call isCalleeStructRet, right?
I thought we wanted to check isCallerStructRet?

urnathan added inline comments.Jul 15 2021, 1:30 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
4848

Not quite. callIsStructReturn's result is tri-valued. Previously we were checking != NotStructReturn. It is sufficient to check == StackStructReturn, as RegStructReturn is tailcallable. I moved the call to callIsStructReturn here, from the caller, because its the only use of that value.

The test llvm/test/CodeGen/X86/sibcall.ll doesn't pass with the current version of the patch; please update.

urnathan abandoned this revision.Jul 16 2021, 7:59 AM

I need a vacation

urnathan updated this revision to Diff 367214.Aug 18 2021, 8:13 AM
urnathan retitled this revision from [X86] pr51000 struct return tailcalling to [X86] pr51000 in-register struct return tailcalling.

Ok, having had a nice long break here's an updated patch. It addresses the case of tail calling a function that returns a struct in registers. These are just like tail calling functions that return scalars.

Is it acceptable to have the FIXME comment pointing at pr51000 about the more complicated case when return is via an sret pointer parameter?

rnk added a subscriber: rnk.Aug 19 2021, 1:56 PM

There's a long discussion involving musttail above that I'm going to ignore, since it doesn't seem to have anything to do with PR51000.

Ok, having had a nice long break here's an updated patch. It addresses the case of tail calling a function that returns a struct in registers. These are just like tail calling functions that return scalars.

I agree, I think the code change is functionally correct: we want to power-down TCO whenever the calling function has to return something in RAX. For now, you're conservatively assuming the callee won't do that, and that's fine.

Is it acceptable to have the FIXME comment pointing at pr51000 about the more complicated case when return is via an sret pointer parameter?

I can't say these are normative conventions, but my experience is that LLVM generally avoids referencing bug numbers from comments. There are plenty of standard references and links to things like docs and object file standards, but otherwise, the idea is that the code comments should have everything you need to reason about the behavior, and the reader shouldn't have to refer to a bug to understand why the code is a certain way.

llvm/lib/Target/X86/X86ISelLowering.cpp
4848

I think the new way of writing this is clearer: the constraint we're really looking for is, do we need to return the sret parameter at the end of the function? If yes, block TCO. If not, do it.

4853

My sense is that the link isn't actually that helpful. I think the comments you wrote are reasonable straightforward. At least, I got it, but maybe I'm special. :)

urnathan updated this revision to Diff 367793.Aug 20 2021, 8:15 AM
urnathan marked 2 inline comments as done.

this removes the pr51000 comment and fixes a test failure on windows -- need to accept @PLT suffixes, hopefully got the regex right for making that optional.

Thanks for your feedback.

xgupta removed a subscriber: xgupta.Aug 20 2021, 8:19 AM
rnk added a subscriber: xbolva00.Aug 23 2021, 3:48 PM

Looks like there's a correctness issue on 32-bit x86.

clang/test/CodeGenCXX/pr51000.cpp
2

This is an integration test between Clang and LLVM, which are generally discouraged. Your code change is to the x86 backend, so the LLVM CodeGen/X86 test should be sufficient. I was unable to find any documentation to point to support this discouragement. >_>

llvm/test/CodeGen/X86/sibcall.ll
661

I dug into the history, and it seems @xbolva00 added this test in a previous attempt to do more TCO around sret. The test is from rG642ed40e57faa34ee3ae7ba0e32f75ff582d408b and the attempt was D46262.

Honestly, I think you can remove the test. I assumed it had some purpose, but I don't see any motivation for this UB in the review comments.

1026–1027

Hang on, this seems wrong. The stack adjustments aren't balanced.

I think we can't do TCO for 32-bit x86. It pops off the sret pointer. See this IR:

define ccc void @t22_f_sret(i32* sret(i32) %p) {
        store i32 0, i32* %p
        ret void
}
define ccc void @t22_non_sret_to_sret(i32* %agg.result) nounwind  {
        tail call ccc void @t22_f_sret(i32* noalias sret(i32) %agg.result) nounwind
        ret void
}

->
$ llc t.ll -o - -mtriple=i686-linux-gnu

t22_f_sret:                             # @t22_f_sret
        .cfi_startproc
# %bb.0:
        movl    4(%esp), %eax
        movl    $0, (%eax)
        retl    $4
.Lfunc_end0:
...
t22_non_sret_to_sret:                   # @t22_non_sret_to_sret
# %bb.0:
        subl    $12, %esp
        movl    16(%esp), %eax
        movl    %eax, (%esp)
        calll   t22_f_sret@PLT
        addl    $8, %esp
        retl

The retl $4 instruction here is key, it adjusts ESP to pop off extra argument memory.

urnathan updated this revision to Diff 368329.Aug 24 2021, 6:32 AM
  • Removed pr51000.cpp testcase
  • Prevent tailcall on i686 from non-sret to sret
  • Removed UB sibcall.ll test

AFAICT by the time we get to this point of code generation, in-register struct return has been flattened to explicit tuple types. So this is mostly already handled by existing scalar/multiple-value sibcall machinery and testing.

The StructReturnType enum's RegStructReturn enumerator is never checked -- so that bit of API could be cleaned up to just be a 'is-sret-pointer' boolean. I also notice that {callIs,argsAre}StructReturn only look at the first result/parm, whereas formal parameter lowering iterates to the first one that has isSret set, and one of the sibcall tests passes a non-initial sret parameter. That seems inconsistent (more UB?)

urnathan marked 3 inline comments as done.Aug 24 2021, 6:35 AM
urnathan added inline comments.
llvm/test/CodeGen/X86/sibcall.ll
1026–1027

oh, I didn't realize sret was callee pop in that case. There is later code in the tailcall checking, concerning caller and callee pop matching, but it doesn't seem prepared to deal with partial popping. Simpler just to explicitly check for i686.

xbolva00 added inline comments.Aug 24 2021, 6:43 AM
llvm/test/CodeGen/X86/sibcall.ll
661

Feel free to delete it

rnk accepted this revision.Aug 24 2021, 2:08 PM

lgtm

Thanks!

This revision is now accepted and ready to land.Aug 24 2021, 2:08 PM
This revision was landed with ongoing or failed builds.Aug 25 2021, 10:16 AM
This revision was automatically updated to reflect the committed changes.
urnathan marked an inline comment as done.