This is an archive of the discontinued LLVM Phabricator instance.

[X86][VARARG] Avoid spilling xmm vararg arguments.
AbandonedPublic

Authored by avl on Oct 24 2019, 3:08 AM.

Details

Summary

That patch fixes https://bugs.llvm.org/show_bug.cgi?id=42219 bug.
Related review D62639.

For the noimplicitfloat mode, the compiler mustn't generate
floating-point code if it was not asked directly to do so.
This rule does not work with variable function arguments currently.
Though compiler correctly guards block of code, which copies xmm vararg parameters with a check for %al,
it does not protect spills for xmm registers. Thus, such spills generated in non-protected areas and could break code, which does not expect floating-point data.
The problem happens in -O0 optimization mode. With this optimization level
there is used FastRegisterAllocator, which spills virtual registers at basic block boundaries.
Register Allocator does not protect spills with additional control-flow modifications.
Thus to resolve that problem, it is suggested to not copy incoming physical
registers into virtual registers. Instead, store incoming physical xmm registers into the memory from scratch.

Another variant of this problem happens with high optimization modes with thunk functions.
At a high optimization level, the Greedy Register Allocator is used.
For the following test case(CodeGen/X86/musttail-varargs.ll) :

define void @f_thunk(i8* %this, ...) {
  ; Use va_start so that we exercise the combination.  %ap = alloca [4 x i8*], align 16
  %ap_i8 = bitcast [4 x i8*]* %ap to i8*
  call void @llvm.va_start(i8* %ap_i8)

  %fptr = call void(i8*, ...)*(i8*) @get_f(i8* %this)  <<<<<<<<<<<<<<<<<<<
  musttail call void (i8*, ...) %fptr(i8* %this, ...)
  ret void
}

f_thunk function needs to propagate all their parameters into %fptr. But, it needs to store/restore virtual registers
corresponding to incoming xmm registers around the call to get_f(). So the final code contains unprotected stores/restores for xmm registers. The solution for that case is also to avoid using virtual registers. Copy incoming physical xmm registers into the memory at the function entry. Restore them from memory right before tail-call jump instruction. New asm code for this case would look like this:

f_thunk: 
        testb   %al, %al
        je      .LBB0_2
# %bb.1:
        movaps  %xmm0, 96(%rsp)  <<< store incoming xmm registers on to stack

.LBB0_2:
        callq   get_f
        testb   %al, %al   <<< check for existence of xmm varargs parameters
        je      .LBB0_4
# %bb.3:
        movaps  96(%rsp), %xmm0        <<< restore xmm varargs parameters before tailcall jump. 
        jmpq    *%r11                   # TAILCALL
.LBB0_4:
        jmpq    *%r11                   # TAILCALL

Diff Detail

Event Timeline

avl created this revision.Oct 24 2019, 3:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2019, 3:08 AM
Herald added subscribers: jfb, hiraditya. · View Herald Transcript
avl edited the summary of this revision. (Show Details)Oct 24 2019, 3:11 AM
avl edited the summary of this revision. (Show Details)Oct 24 2019, 3:13 AM
asl added inline comments.Oct 24 2019, 8:41 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
3699

What is the status of all these TODO here and there?

llvm/lib/Target/X86/X86ISelLowering.h
264

Do you really need these changes?

avl marked 2 inline comments as done.Oct 24 2019, 9:51 AM
avl added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
3699

I assume they would be done later. They are not required for correctness.

  1. not storing xmm registers if "no fp, only musttail calls, noimplcitfloat". I am going to do after this patch and D62639 would be integrated.
  2. support of YMM and ZMM would be done by someone who would implement support of YMM and ZMM registers for varargs. this is separate task.
  3. comment for AArch64 - for someone who would implement that on AArch64.
llvm/lib/Target/X86/X86ISelLowering.h
264

no, I don`t . it was done by clang-format.

asl added inline comments.Oct 24 2019, 9:59 AM
llvm/lib/Target/X86/X86ISelLowering.h
264

Please do not add unrelated changes

avl marked an inline comment as done.Oct 24 2019, 10:13 AM
avl added inline comments.
llvm/lib/Target/X86/X86ISelLowering.h
264

Ok, I will delete them. clang-format added above changes since VARARG_THUNK_SAVE_XMM_REGS was added. In that sense the changes are related. But OK, I will delete them.

avl updated this revision to Diff 227827.Nov 5 2019, 1:38 AM

removed formatting done by clang-format for X86ISelLowering.h. ping...

avl updated this revision to Diff 227838.Nov 5 2019, 3:10 AM

fix small typo introduced by previous update.

Is it possible to avoid expanding VARARG_THUNK_SAVE_XMM_REGS until after register allocation? I would rather not make the liveness rules more complicated just for the sake of working around a limitation of fast regalloc.

avl added a comment.Nov 7 2019, 2:03 AM

Is it possible to avoid expanding VARARG_THUNK_SAVE_XMM_REGS until after register allocation? I would rather not make the liveness rules more complicated just for the sake of working around a limitation of fast regalloc.

yes, it is. Would do that way. Thanks.

avl updated this revision to Diff 228669.Nov 11 2019, 4:50 AM

addressed comments(moved expansion of xmm registers code until after register allocator).

avl updated this revision to Diff 234013.Dec 16 2019, 3:04 AM

rebased. ping.

avl added a comment.Jan 6 2020, 7:30 AM

ping.

Colleagues, any comments on this review?

efriedma edited reviewers, added: rnk; removed: efriedma.Jan 13 2020, 1:07 PM

Target-independent changes look fine.

I'm not planning to review the x86 backend changes.

craig.topper added inline comments.Jan 13 2020, 2:45 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
3543

First letter of variables should be capitalized.

3692

!guardedXmmRegs.empty()

rnk requested changes to this revision.Jan 13 2020, 2:50 PM

For the noimplicitfloat mode, the compiler mustn't generate
floating-point code if it was not asked directly to do so.

I don't think checking AL on function entry will work reliably. In general, musttail is meant to forward all possible register parameters. Normal arguments may be passed in XMM registers without setting AL. Consider this code:

typedef float v4f32 __attribute__((__vector_size__(16), __aligned__(16)));
int gv_i32;
v4f32 gv_v4f32;
void bar(int x, v4f32 v) {
  gv_i32 = x;
  gv_v4f32 = v;
}
void foo() {
  v4f32 v = {};
  bar(42, v);
}

It should be possible to use musttail thunk with a prototype of void (int, ...) between the call to foo and bar. And, if that thunk contains a function call, it will need to spill&fill all XMM argument registers. If you look at the generated assembly for calling bar, AL is not set to 1:

# %bb.0:                                # %entry
        pushq   %rax
        .cfi_def_cfa_offset 16
        xorps   %xmm0, %xmm0
        movl    $42, %edi
        callq   _Z3bariDv4_f

Is there actually a use case for musttail thunks and noimplicitfloat? Could we instead just declare that such a thunk represents explicit FP usage?

Furthermore, this will probably do the wrong thing on Windows, where AL is never set, to my knowledge. Consider this C++ code that uses musttail in the MS ABI:

int gv_i32;
double gv_f64;
class Foo {
  void virtual bar(int x, double v);
  void foo();
};
void Foo::bar(int x, double v) {
  gv_i32 = x;
  gv_f64 = v;
}
void Foo::foo() {
  auto mp = &Foo::bar;
  (this->*mp)(42, 0.0);
}

The virtual member pointer thunk needs to pass along XMM2, although it doesn't contain a function call.

llvm/lib/Target/X86/X86ExpandPseudo.cpp
250

This looks like a lot of very fragile pattern matching. I would greatly prefer it if we didn't have to do this.

llvm/lib/Target/X86/X86ISelLowering.cpp
3574

We don't in general reference the issue tracker from code comments, only from test cases.

This revision now requires changes to proceed.Jan 13 2020, 2:50 PM
avl added a comment.Jan 14 2020, 10:40 AM
In D69372#1818166, @rnk wrote:

For the noimplicitfloat mode, the compiler mustn't generate
floating-point code if it was not asked directly to do so.

I don't think checking AL on function entry will work reliably. In general, musttail is meant to forward all possible register parameters. Normal arguments may be passed in XMM registers without setting AL. Consider this code:

typedef float v4f32 __attribute__((__vector_size__(16), __aligned__(16)));
int gv_i32;
v4f32 gv_v4f32;
void bar(int x, v4f32 v) {
  gv_i32 = x;
  gv_v4f32 = v;
}
void foo() {
  v4f32 v = {};
  bar(42, v);
}

It should be possible to use musttail thunk with a prototype of void (int, ...) between the call to foo and bar.

if I correctly understood the example - such a situation should not occur.
My understanding is that musttail thunk and it`s target _must_ have identical signatures.

In the above example, thunk assumed this definition: void (int, ...) while the target has this definition: void (int x, v4f32 v), that is wrong.

There should be either:

thunk: void (int, ...)
target: void (int, ...)

either:

thunk: void (int x, v4f32 v)
target: void (int x, v4f32 v)

In both these cases, this patch would work correctly.
Thus, assuming identical declarations, checking AL should work correctly.

And, if that thunk contains a function call, it will need to spill&fill all XMM argument registers. If you look at the generated assembly for calling bar, AL is not set to 1:

# %bb.0:                                # %entry
        pushq   %rax
        .cfi_def_cfa_offset 16
        xorps   %xmm0, %xmm0
        movl    $42, %edi
        callq   _Z3bariDv4_f

Right. AL is not set to 1. That is so because bar() does not use variable arguments according to its declaration.
Thunk function should assume the same signature and do not use AL trick then.

Is there actually a use case for musttail thunks and noimplicitfloat? Could we instead just declare that such a thunk represents explicit FP usage?

I do not know the real examples. But in general, it looks like a valid use case(musttail thunks and noimplicitfloat).

Furthermore, this will probably do the wrong thing on Windows, where AL is never set, to my knowledge. Consider this C++ code that uses musttail in the MS ABI:

int gv_i32;
double gv_f64;
class Foo {
  void virtual bar(int x, double v);
  void foo();
};
void Foo::bar(int x, double v) {
  gv_i32 = x;
  gv_f64 = v;
}
void Foo::foo() {
  auto mp = &Foo::bar;
  (this->*mp)(42, 0.0);
}

The virtual member pointer thunk needs to pass along XMM2, although it doesn't contain a function call.

This patch does not work on windows . It is only for amd64 ABI:

bool X86CallLowering::lowerCall()
...

if (STI.is64Bit() && !IsFixed && !STI.isCallingConvWin64(Info.CallConv)) {
  // From AMD64 ABI document:
  // For calls that may call functions that use varargs or stdargs
  // (prototype-less calls or calls to functions containing ellipsis (...) in
  // the declaration) %al is used as hidden argument to specify the number
  // of SSE registers used. The contents of %al do not need to match exactly
  // the number of registers, but must be an ubound on the number of SSE
  // registers used and is in the range 0 - 8 inclusive.

  MIRBuilder.buildInstr(X86::MOV8ri)
      .addDef(X86::AL)
      .addImm(Handler.getNumXmmRegs());
  MIB.addUse(X86::AL, RegState::Implicit);
}

X86ISelLowering.cpp:
// FIXME: Get this from tablegen.
static ArrayRef<MCPhysReg> get64BitArgumentXMMs(MachineFunction &MF,

                                              CallingConv::ID CallConv,
                                              const X86Subtarget &Subtarget) {
assert(Subtarget.is64Bit());
if (Subtarget.isCallingConvWin64(CallConv)) {
  // The XMM registers which might contain var arg parameters are shadowed
  // in their paired GPR.  So we only need to save the GPR to their home
  // slots.
  // TODO: __vectorcall will change this.
  return None;
}
avl marked 3 inline comments as done.Jan 14 2020, 10:47 AM
avl added inline comments.
llvm/lib/Target/X86/X86ExpandPseudo.cpp
250

would rewrite it.

rnk added a comment.Jan 14 2020, 3:34 PM
In D69372#1820099, @avl wrote:
In D69372#1818166, @rnk wrote:

It should be possible to use musttail thunk with a prototype of void (int, ...) between the call to foo and bar.

if I correctly understood the example - such a situation should not occur.
My understanding is that musttail thunk and it`s target _must_ have identical signatures.

The third point under musttail is meant to create an exception for for perfectly forwarding thunks: https://llvm.org/docs/LangRef.html#id325

If the musttail call appears in a function with the "thunk" attribute and the caller and callee both have varargs, than any unprototyped arguments in register or memory are forwarded to the callee. Similarly, the return value of the callee is returned to the caller’s caller, even if a void return type is in use.

In retrospect, I think it may have been a mistake to repurpose varargs to indicate that all remaining argument registers should be preserved. Maybe we don't need to use a varargs function prototype to implement perfectly forwarding thunks. We ended up needing the "thunk" function attribute, so we could base it on that instead.

This patch does not work on windows . It is only for amd64 ABI:

OK, but the C++ code I provided illustrates the use case of perfect forwarding, which might come up on Linux. It sets up a chain of calls that looks like:

  • indirect call to thunk, pass FP in XMM ->
  • in thunk, save XMM, use XMM, restore XMM, tail call to callee ->
  • receive XMM value in callee

The use case for perfectly forwarding thunks is pretty rare. I don't think there is a way to convince clang to generate one for Linux, just Windows.


Anyway, it seems like you could do away with a fair amount of the complexity in this patch for handling guarding musttail forwarding by declaring them to be an "explicit FP use", since it is not, in general, possible for a perfectly forwarding thunk to know if XMM arguments have been used.

avl added a comment.Jan 15 2020, 10:07 AM
In D69372#1820845, @rnk wrote:
In D69372#1820099, @avl wrote:
In D69372#1818166, @rnk wrote:

It should be possible to use musttail thunk with a prototype of void (int, ...) between the call to foo and bar.

if I correctly understood the example - such a situation should not occur.
My understanding is that musttail thunk and it`s target _must_ have identical signatures.

The third point under musttail is meant to create an exception for for perfectly forwarding thunks: https://llvm.org/docs/LangRef.html#id325

If the musttail call appears in a function with the "thunk" attribute and the caller and callee both have varargs, than any unprototyped arguments in register or memory are forwarded to the callee. Similarly, the return value of the callee is returned to the caller’s caller, even if a void return type is in use.

In retrospect, I think it may have been a mistake to repurpose varargs to indicate that all remaining argument registers should be preserved. Maybe we don't need to use a varargs function prototype to implement perfectly forwarding thunks. We ended up needing the "thunk" function attribute, so we could base it on that instead.

It looks like point 3 does not allow to have thunk signature to be "void (int, ...)" and callee signature "void (int x, v4f32 v)".
That point means that if caller and callee both have varargs, all varargs arguments should be properly forwarded from caller to callee.

There is another rule which looks relevant:

"The callee must be varargs iff the caller is varargs. Bitcasting a non-varargs function to the appropriate varargs type is legal so long as the non-varargs prefixes obey the other rules."

i.e. if caller/thunk signature "void (int, ...)", then the callee signature could be "void (int x, v4f32 v)".
Caller while calls thunk should use "void (int, ...)" signature.
Caller must set AL because of thunk signature "void (int, ...)".

that scenario should work correctly with/without this patch.

if currently there is a use case when caller/thunk have this signature "void (int, ...)" and AL is not set by caller(for AMD64 ABI) - then it looks like a bug in implementation.

This patch does not work on windows . It is only for amd64 ABI:

OK, but the C++ code I provided illustrates the use case of perfect forwarding, which might come up on Linux. It sets up a chain of calls that looks like:

  • indirect call to thunk, pass FP in XMM ->
  • in thunk, save XMM, use XMM, restore XMM, tail call to callee ->
  • receive XMM value in callee

The use case for perfectly forwarding thunks is pretty rare. I don't think there is a way to convince clang to generate one for Linux, just Windows.

It seems to me that this case is already properly handled and my patch does not break it.
This example does not have varargs in function declarations.
Thus AL trick would not be used.

Both thunk and target have this signature : void (int x, double v).
It explicitly uses XMM(as per ABI for func params).
All following manipulations would be explicitly done:

  • indirect call to thunk, pass FP in XMM -> OK
  • in thunk, save XMM, use XMM, restore XMM, tail call to callee -> OK
  • receive XMM value in callee -> OK

What would start to be wrong with this patch?

avl added a comment.Jan 17 2020, 11:45 AM

@rnk

Anyway, it seems like you could do away with a fair amount of the complexity in this patch for handling guarding musttail forwarding by declaring them to be an "explicit FP use", since it is not, in general, possible for a perfectly forwarding thunk to know if XMM arguments have been used.

This effectively means that any program containing thunks could not be compiled for non-floating point environment even if it does not use fp at all.
Taking into account that thunks are usually inserted by the compiler - that could be unexpected and hard to solve for the user.
i.e. The program could not have floating point code at all but it would not be possible to compile it because of thunks.
I think it is not a good alternative.

In retrospect, I think it may have been a mistake to repurpose varargs to indicate that all remaining argument registers should be preserved. Maybe we don't need to use a varargs function prototype to implement perfectly forwarding thunks. We ended up needing the "thunk" function attribute, so we could base it on that instead.

I agree that it is a mistake to use varargs for perfectly forwarding thunks purposes. Since it does not conform to current behavior and documented rules:

  1. testcase from llvm/test/CodeGen/X86/musttail-varargs.ll:
declare void @llvm.va_start(i8*) nounwind

declare void(i8*, ...)* @get_f(i8* %this)

define void @f_thunk(i8* %this, ...) {
  %ap = alloca [4 x i8*], align 16
  %ap_i8 = bitcast [4 x i8*]* %ap to i8*
  call void @llvm.va_start(i8* %ap_i8)

  %fptr = call void(i8*, ...)*(i8*) @get_f(i8* %this)
  musttail call void (i8*, ...) %fptr(i8* %this, ...)
  ret void
}

It has "f_thunk" which has llvm.va_start. Thus it saves incoming xmm registers into va_start area :

; LINUX-NEXT:    testb %al, %al
; LINUX-NEXT:    je .LBB0_2
; LINUX-NEXT:  # %bb.1:
; LINUX-NEXT:    movaps %xmm0, {{[0-9]+}}(%rsp)
; LINUX-NEXT:    movaps %xmm1, {{[0-9]+}}(%rsp)
; LINUX-NEXT:    movaps %xmm2, {{[0-9]+}}(%rsp)
; LINUX-NEXT:    movaps %xmm3, {{[0-9]+}}(%rsp)
; LINUX-NEXT:    movaps %xmm4, {{[0-9]+}}(%rsp)
; LINUX-NEXT:    movaps %xmm5, {{[0-9]+}}(%rsp)
; LINUX-NEXT:    movaps %xmm6, {{[0-9]+}}(%rsp)
; LINUX-NEXT:    movaps %xmm7, {{[0-9]+}}(%rsp)
; LINUX-NEXT:  .LBB0_2:

if f_thunk would be called without setting AL it would work incorrectly.

  1. calling "void (int, ...)" function without setting AL is AMD64 ABI violation.
  1. https://llvm.org/docs/LangRef.html#id325

    The caller and callee prototypes must match. Pointer types of parameters or return types may differ in pointee type, but not in address space. The calling conventions of the caller and callee must match. All ABI-impacting function attributes, such as sret, byval, inreg, returned, and inalloca, must match.

    These rules clearly states that thunk signature and callee signature should agree on calling conventions. if thunk assumes that AL should be set then it is an error to not setting it.

Probably we could fix implementation of perfectly forwarding thunks in an ABI compatible way ?

avl added a comment.Jan 20 2020, 3:55 AM

@rnk Reid, I am trying to research the question of using varargs thunk for non-varargs methods.
(like using "void (int, ...)" thunk with "void (int x, v4f32 v)" signature as from the previous discussion).
I found following place which looks like precisely the case which uses musttail for not only varargs case:

CGVTables.cpp:EmitCallAndReturnForThunk():

// If perfect forwarding is required a variadic method, a method using
// inalloca, or an unprototyped thunk, use musttail. Emit an error if this
// thunk requires a return adjustment, since that is impossible with musttail.
if (CurFnInfo->usesInAlloca() || CurFnInfo->isVariadic() || IsUnprototyped) {

In cited fragments "CurFnInfo->isVariadic()" is a common usage for musttail.
"CurFnInfo->usesInAlloca()" and "IsUnprototyped" looks like extended usages.
Both of these features are specific for MS ABI:

  1. InAlloca
https://llvm.org/docs/InAlloca.html

Primarily, this feature is required for compatibility with the Microsoft C++ ABI.
  1. Unprototyped thunk
CGVTables.cpp:EmitCallAndReturnForThunk()
// Arrange a function prototype appropriate for a function definition. In some
// cases in the MS ABI, we may need to build an unprototyped musttail thunk.
const CGFunctionInfo &FnInfo =
    IsUnprototyped ? CGM.getTypes().arrangeUnprototypedMustTailThunk(MD)
                   : CGM.getTypes().arrangeGlobalDeclaration(GD);
llvm::FunctionType *ThunkFnTy = CGM.getTypes().GetFunctionType(FnInfo);

Thus it looks like the case using musttail for non-varargs functions never happens with AMD64 ABI.
So It is safe for this patch(Avoid spilling xmm vararg arguments) to rely on ABI requirement, and AL register should always be properly set for vararg functions.

The ability to avoid floating point usages is essential for environments not using floating point. If thunks would be considered as "explicit FP use" then many simple C++ programs could not be compiled for non-floating point environment because of unintended FP usage.

avl updated this revision to Diff 242407.Feb 4 2020, 12:54 PM

addressed comments:

Do not create guarded registers for functions with "thunk" attribute.
Resolve style issues.
Check for MachineInstr::FrameDestroy instead of checking instruction patterns.

Unit tests: fail. 62440 tests passed, 1 failed and 845 were skipped.

failed: libc++.std/thread/thread_mutex/thread_mutex_requirements/thread_mutex_requirements_mutex/thread_mutex_class/try_lock.pass.cpp

clang-tidy: fail. clang-tidy found 0 errors and 2 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

avl added a comment.Feb 4 2020, 1:29 PM

clang-format recommendations: I was asked to not apply them in this review.
clang-tidy - would apply.
unit-test - looks like not reproduced, though I would check additionally.

avl updated this revision to Diff 243463.Feb 9 2020, 1:27 PM

applied clang-tidy reccomendations, removed icall_branch_funnel workaround.

avl added a comment.Feb 9 2020, 10:05 PM

Unit tests pass. 62573 tests passed, 0 failed and 842 were skipped.

clang-tidy pass.

clang-format fail.

rnk added inline comments.Feb 10 2020, 5:15 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
3711

I haven't reviewed all of this code, but we have to find some way to refactor LowerCall. It was already poorly factored and long, but this is just too much complexity, too many lines of code. We need to find some way to separate concerns.

llvm/test/CodeGen/X86/musttail-varargs.ll
299–300

This seems unfortunate. :( Does all this go away if you put the "thunk" attribute on it? Everything in this file is meant to be a test for universal thunks, so adding the attribute is reasonable.

avl marked 2 inline comments as done.Feb 11 2020, 2:40 AM
avl added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
3711

Ok, I would refactor it. Would it be OK if that refactoring would be a part of this patch ? Or Do I need to make it previously in separate patch ?

llvm/test/CodeGen/X86/musttail-varargs.ll
299–300

yes. all this xmm save/restore code would go away if "thunk" is specified. I will add it to the test case.

additionally, I would like to make separate patch which would NOT do this xmm store/restore if noimplicitfloat=false. So that store/restore code is generated for only noimplicitfloat=true case.

thus, I assume following patches would be done:

  1. this patch. xmm stores/restores through phys regs would be generated for all _usual|_ thunks(not including universal thunks)
  2. patch which will fix ABI breakage for noimplicitfloat case - D62639
  3. do not generate xmm store/restore through phys regs for noimplicitfloat=false case.
avl planned changes to this revision.Oct 22 2020, 3:00 AM
avl abandoned this revision.Mar 14 2023, 8:09 AM

already fixed by D80163.

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2023, 8:09 AM