This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Allow comdat for kernels
ClosedPublic

Authored by yaxunl on Oct 25 2021, 2:31 PM.

Details

Summary

Two identical instantiations of a template function can be emitted by two TU's
with linkonce_odr linkage without causing duplicate symbols in linker. MSVC
also requires these symbols be in comdat sections. Linux does not require
the symbols in comdat sections to be merged by linker but by default
clang puts them in comdat sections.

If a template kernel is instantiated identically in two TU's. MSVC requires
that them to be in comdat sections, otherwise MSVC linker will diagnose them as
duplicate symbols. However, currently clang does not put instantiated template
kernels in comdat sections, which causes link error for MSVC.

This patch allows putting instantiated template kernels into comdat sections.

Diff Detail

Event Timeline

yaxunl requested review of this revision.Oct 25 2021, 2:31 PM
yaxunl created this revision.
tra added a comment.EditedNov 1 2021, 1:55 PM

As phrased, the summary would likely be rather confusing for anyone other than you and me.

Currently Visual Studio 2019 has a linker issue which causes linking error
when a template kernel is instantiated in different compilation units.

It's not clear what exactly is the issue and what causes it.

On the other hand, it is unnecessary to prefix kernel stub for MSVC
target since the host and device compilation uses different mangling
ABI.

This could use more details on why different mangling matters here. IIRC, on Linux where both host and device use the same mangling and HIP needed a way to tell apart the GPU-side kernels and their host-side stub. Different mangling makes it a non-issue.

This patch let clang not emit kernel handle for MSVC target to work around the linker issue.

Again, without the back-story the jump from linking error to mangling differences to "let's not emit a handle" does not make much sense.

I'd restructure it along the line of:

  • we emit host-side handles to match GPU-side kernels
  • the handles cause linking issues on windows because of X/Y/Z.
  • handles are not necessary on Windows, because of the different host/device mangling
  • not generating the handles avoids the linking issue on Windows.

This prompts the question -- should/could handle generation be improved instead? Having identical behavior on all platforms would arguably be better than a platform-specific workaround.

yaxunl added a comment.Nov 4 2021, 8:09 AM

As phrased, the summary would likely be rather confusing for anyone other than you and me.

Currently Visual Studio 2019 has a linker issue which causes linking error
when a template kernel is instantiated in different compilation units.

It's not clear what exactly is the issue and what causes it.

Currently, the identical instantiation of a template kernel in different TU causes linking error about duplicate symbols on Windows.

In the beginning, I thought it was due to a bug in MSVC linker. However, further investigation shows that it is not a MSVC linker bug, but a bug of clang.

Basically, it is not sufficient to set linkonce_odr linkage to let MSVC linker merge symbols. The symbol needs to be in comdat sections. This is not required on Linux. However, it would be more consistent to always put kernel stubs and kernel handles with linkonce_odr linkage into comdat if the target supports it.

I will update this patch to fix the comdat for kernel stub and kernel handle.

On the other hand, it is unnecessary to prefix kernel stub for MSVC
target since the host and device compilation uses different mangling
ABI.

This could use more details on why different mangling matters here. IIRC, on Linux where both host and device use the same mangling and HIP needed a way to tell apart the GPU-side kernels and their host-side stub. Different mangling makes it a non-issue.

This patch let clang not emit kernel handle for MSVC target to work around the linker issue.

Again, without the back-story the jump from linking error to mangling differences to "let's not emit a handle" does not make much sense.

I'd restructure it along the line of:

  • we emit host-side handles to match GPU-side kernels
  • the handles cause linking issues on windows because of X/Y/Z.
  • handles are not necessary on Windows, because of the different host/device mangling
  • not generating the handles avoids the linking issue on Windows.

This prompts the question -- should/could handle generation be improved instead? Having identical behavior on all platforms would arguably be better than a platform-specific workaround.

With the fix of comdat issue, we should be able to use a consistent kernel launching mechanism for Linux and Windows. Since the debugger requests kernel stub to have a different name than the kernel, we have to let the kernel stub and kernel handle have different names. That mechanism will stay unchanged.

I also found that MSVC name mangling currently does not add device_stub prefix to the mangled name of kernel stubs. I will update this patch to fix that.

With these changes, we should have consistent name mangling for kernel stubs and kernel launching mechanism on Linux and Windows.

tra added a comment.Nov 4 2021, 10:12 AM

With these changes, we should have consistent name mangling for kernel stubs and kernel launching mechanism on Linux and Windows.

Nice! Thank you for figuring out the root causes.

yaxunl updated this revision to Diff 385812.Nov 9 2021, 7:54 AM
yaxunl retitled this revision from [HIP] Do not use kernel handle for MSVC target to [CUDA][HIP] Allow comdat for kernels.
yaxunl edited the summary of this revision. (Show Details)

fix comdat instead

tra added subscribers: kpyzhov, rnk.Nov 9 2021, 11:03 AM
tra added inline comments.
clang/lib/CodeGen/CodeGenModule.cpp
4290–4293

This was added in D63277 specifically to deal with comdat-related issue on windows.

We do need to have unique addresses for each kernel stub. Placing stubs into comdat may allow them to be merged into one and that would be a problem.
https://docs.microsoft.com/en-us/cpp/build/reference/opt-optimizations?view=vs-2019

@rnk,@kpyzhov -- how do we make sure that identical functions are placed in comdat (needed for templates) but do not get comdat-folded with /OPT:ICF?

rnk added a subscriber: zequanwu.Nov 9 2021, 11:31 AM
rnk added inline comments.
clang/lib/CodeGen/CodeGenModule.cpp
4290–4293

These are readonly global variables. I believe MSVC's ICF implementation changed behavior here a few times over the years. I think the current state is that they don't merge such globals. If you are using LLD, you can use safe ICF to avoid merging such globals.

@zequanwu reviewed the state of ICF in LLD in the last year, so he may know more about the current status, and he may be more helpful if you have more questions.

Finally, you could mark these globals as mutable if you really want to block ICF. That will work reliably with any linker.

rnk added inline comments.Nov 9 2021, 11:34 AM
clang/lib/CodeGen/CodeGenModule.cpp
4290–4293

Broadly, I think this change is correct. If I had reviewed D63277, I probably would have objected to it and not approved it.

yaxunl added a comment.Nov 9 2021, 2:19 PM

I think probably it is necessary to merge linkonce_odr symbols for them to work properly.

Consider the following testcase:

// a.cu
template<typename T>
__global__ void foo(T x) {}

void test1() {
    foo<<<1,1>>>(1);
}

// b.cu
template<typename T>
__global__ void foo(T x) {}

void test2() {
    foo<<<1,1>>>(1);
}

// c.cu
template<typename T>
__global__ void foo(T x);

int main() {
    foo<<<1,1>>>(1);
}

Assume a.cu, b.cu, and c.cu are compiled with default -fno-gpu-rdc option and linked together.

Both a.obj and b.obj contain a global symbol foo<int> as the kernel stub function. c.obj contains reference to foo<int>, so it has to resolve to foo<int> in a.obj or b.obj. It only makes sense for linker to merge foo<int> in a.obj and b.obj and let c.obj resolve to the merged symbol. This also requires that the fat binary embedded in a.obj and b.obj must contain the identical definition of kernel foo<int>. That is, if ODR is followed, even though there are two fat binaries containing kernel foo<int>, only one of them will be used (it is fine since they are identical), which corresponds to the merged symbol for the kernel stub foo<int>.

The implication is that, we have to ask users to follow ODR even with the default -fno-gpu-rdc option. And users cannot have different definitions for the same template instantiation (e.g. foo<int>) in different TU's, otherwise there will be UB.

Considering ODR is a fundamental assumption for C++, I think it is justifiable to request users to follow that no matter whether -fgpu-rdc or -fno-gpu-rdc.

tra added a comment.Nov 9 2021, 2:45 PM

Yes, we do need to merge identical functions with identical names for templates.

The comdat-folding issue is different. IIUIC, it allows merging two functions with identical code and different names, into one function with two names. That will break CUDA as we do need to have each stub to have a unique address as we use it to find the matching GPU-side kernel.

rnk added a comment.Nov 9 2021, 3:27 PM

Yes, we do need to merge identical functions with identical names for templates.

The comdat-folding issue is different. IIUIC, it allows merging two functions with identical code and different names, into one function with two names. That will break CUDA as we do need to have each stub to have a unique address as we use it to find the matching GPU-side kernel.

Well, yes, ICF breaks function pointer identity. There's no way around that, and it is documented:
https://docs.microsoft.com/en-us/cpp/build/reference/opt-optimizations?view=msvc-170
CUDA users will have to remove /OPT:ICF from their linker flags.

Maybe you could make this work by embedding an ICF-breaking device into all the stubs. Something like a volatile asm blob that takes the current function as an argument and puts it in a register.

yaxunl added inline comments.Nov 10 2021, 7:05 AM
clang/lib/CodeGen/CodeGenModule.cpp
4290–4293

These symbols are for kernel stub functions. How do I mark them as mutable?

I did an experiment regarding the ICF issue and it seems not to affect kernel stub.

#include "hip/hip_runtime.h"

template<typename T>
void bar(T x) { }

template<typename T>
__global__ void foo(T x) {}

int main() {
  foo<<<1,1>>>(1);
  printf("%p\n", foo<int>);
  printf("%p\n", foo<float>);
  printf("%p\n", bar<int>);
  printf("%p\n", bar<float>);
}

If I pass -Wl,/opt:noicf, I got

00007FF622A01100
00007FF622A01170
00007FF622A01360
00007FF622A01370

By default, I got

00007FF693521100
00007FF693521170
00007FF693521360
00007FF693521360

This indicates bar<int> and bar<float> are folded but kernel stubs are not folded.

I also tried -Wl,/opt:icf=10, and kernel stubs are still not folded.

For HIP, since the kernel stub passes a unique kernel symbol to the internal kernel launching API, you may think the kernel stubs are not folded because they are not identical.

To imitate the CUDA case, where the address of kernel stub function itself is passed to the internal kernel launching API, I used the original patch of this review, where the kernel stub function passes the address of itself to the internal kernel launching API, therefore in a sense, the kernel stubs are all the same. Still, the kernel stubs are not folded.

Looking at the assembly of the kernel stub function:

; foo<int>
.seh_proc "??$foo@H@@YAXH@Z"
# %bb.0:
        pushq   %rsi
        .seh_pushreg %rsi
        pushq   %rdi
        .seh_pushreg %rdi
        subq    $120, %rsp
        .seh_stackalloc 120
        .seh_endprologue
        movl    %ecx, 60(%rsp)
        leaq    60(%rsp), %rax
        movq    %rax, 64(%rsp)
        leaq    104(%rsp), %rsi
        leaq    88(%rsp), %rdi
        leaq    80(%rsp), %r8
        leaq    72(%rsp), %r9
        movq    %rsi, %rcx
        movq    %rdi, %rdx
        callq   __hipPopCallConfiguration
        movq    80(%rsp), %rax
        movq    72(%rsp), %rcx
        movq    %rcx, 40(%rsp)
        movq    %rax, 32(%rsp)
        leaq    "??$foo@H@@YAXH@Z"(%rip), %rcx
        leaq    64(%rsp), %r9
        movq    %rsi, %rdx
        movq    %rdi, %r8
        callq   hipLaunchKernel
        nop
        addq    $120, %rsp
        popq    %rdi
        popq    %rsi
        retq
        .seh_endproc

; foo<float>
.seh_proc "??$foo@M@@YAXM@Z"
# %bb.0:
        pushq   %rsi
        .seh_pushreg %rsi
        pushq   %rdi
        .seh_pushreg %rdi
        subq    $120, %rsp
        .seh_stackalloc 120
        .seh_endprologue
        movss   %xmm0, 60(%rsp)
        leaq    60(%rsp), %rax
        movq    %rax, 64(%rsp)
        leaq    104(%rsp), %rsi
        leaq    88(%rsp), %rdi
        leaq    80(%rsp), %r8
        leaq    72(%rsp), %r9
        movq    %rsi, %rcx
        movq    %rdi, %rdx
        callq   __hipPopCallConfiguration
        movq    80(%rsp), %rax
        movq    72(%rsp), %rcx
        movq    %rcx, 40(%rsp)
        movq    %rax, 32(%rsp)
        leaq    "??$foo@M@@YAXM@Z"(%rip), %rcx
        leaq    64(%rsp), %r9
        movq    %rsi, %rdx
        movq    %rdi, %r8
        callq   hipLaunchKernel
        nop
        addq    $120, %rsp
        popq    %rdi
        popq    %rsi
        retq
        .seh_endproc

I think they are not folded because link.exe is smart enough to treat them as not identical comdat functions. I think we may stop worrying about the ICF foading kernel stubs.

rnk added a comment.Nov 10 2021, 10:25 AM

I think the key is the self-reference in the LEA instruction:

; foo<int>
.seh_proc "??$foo@H@@YAXH@Z"
...
        leaq    "??$foo@H@@YAXH@Z"(%rip), %rcx
...
; foo<float>
.seh_proc "??$foo@M@@YAXM@Z"
...
        leaq    "??$foo@M@@YAXM@Z"(%rip), %rcx

I think they are not folded because link.exe is smart enough to treat them as not identical comdat functions. I think we may stop worrying about the ICF foading kernel stubs.

It sounds like the behavior may have changed since D63277 was landed, and maybe we don't need it anymore.

tra accepted this revision.Nov 10 2021, 10:48 AM
This revision is now accepted and ready to land.Nov 10 2021, 10:48 AM
This revision was landed with ongoing or failed builds.Nov 10 2021, 1:43 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2021, 1:43 PM