This is an archive of the discontinued LLVM Phabricator instance.

[LLD][COFF] Fix dll import for thread_local storage
AbandonedPublic

Authored by zero9178 on Jan 27 2020, 2:37 PM.

Details

Reviewers
rnk
mstorsjo
Summary

DLLs with thread local storage use the relocaction type IMAGE_REL_AMD64_SECREL which have no bitsize specified leading to a linking failure with lld. This adds the bitsize from ld.bfd https://github.com/bminor/binutils-gdb/blob/3024a17ae029ec7f55b498e99ddd6238e22fe565/bfd/coff-x86_64.c#L340

Would need someone to commit this for me if accepted

Diff Detail

Event Timeline

zero9178 created this revision.Jan 27 2020, 2:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2020, 2:37 PM
zero9178 edited the summary of this revision. (Show Details)Jan 27 2020, 3:33 PM

First off, in general, this is lacking a test, and when uploading diffs, please include context (git diff -U9999) to make it more readable here within Phabricator.

However, I'm fairly sure this isn't the right thing to do and won't work as you hope it would. What the missing context would show is that this is from getRuntimePseudoRelocSize. The pseudo relocs are used for rewriting either absolute or relative addresses, to point to the actual location of the variable in a different DLL.

However, IMAGE_REL_AMD64_SECREL is used for getting the offset of a symbol within the enclosing section. For TLS variables, there's a single instance of the variable within the process, but every thread has a separate allocation of all TLS variables, and IMAGE_REL_AMD64_SECREL is used for getting the offset of the variable within the current thread's TLS allocation. Adding a relative offset to a different DLL's address space certainly won't work.

Also to respond to the summary:

DLLs with thread local storage use the relocaction type IMAGE_REL_AMD64_SECREL

This is used for TLS in general (and debug info), whether it's used in DLLs or not.

which have no bitsize specified leading to a linking failure with lld. This adds the bitsize from ld.bfd

The IMAGE_REL_AMD64_SECREL relocation is well supported in general, see https://github.com/llvm/llvm-project/blob/master/lld/COFF/Chunks.cpp#L113 and https://github.com/llvm/llvm-project/blob/master/lld/COFF/Chunks.cpp#L78-L88. So the fact that this relocation touches 32 bit in the image is well known by lld.

But it is not in the list of relocations in getRuntimePseudoRelocSize, as only the relocations that you sensibly can fix up that way is listed there.

In general, I would not expect raw access to TLS variables across DLL borders to work - at least not with the native TLS mechanism that clang uses by default (GCC uses emulated tls, I'm not familiar with how that works). But I would presume that use of TLS variables in such cases would produce a call to the TLS wrapper function instead, which should be able to successfully provide access to the variable from another DLL.

So the thing you're trying to fix needs a minimized reproducible testcase (for looking into the whole situation, not a LLD test just for testing that linking succeeds), to be able to reason about how you came into this situation, and whether the suggested fix really works on the whole.

Regarding the DLL boundary:
Clangs native approach uses the .tls sections of PE, just like MSVC. Microsofts documentation here: https://docs.microsoft.com/en-us/previous-versions/6yh4a9k1%28v%3dvs.140%29 States that it should work over dll boundaries if the dll is auto imported by the executable and work with a LoadLibrary call if it's Windows Vista or newer.

I will further investigate your other issues mentioned. I first encountered the issue when trying to build current trunk version of LLVM using BUILD_SHARED_LIBS on.

Regarding the DLL boundary:
Clangs native approach uses the .tls sections of PE, just like MSVC.

Yes

Microsofts documentation here: https://docs.microsoft.com/en-us/previous-versions/6yh4a9k1%28v%3dvs.140%29 States that it should work over dll boundaries if the dll is auto imported by the executable and work with a LoadLibrary call if it's Windows Vista or newer.

No, that's definitely not what it says. It says that TLS variables in DLLs will work (have space allocated for them etc) under these circumstances, which works when accessed from within the DLL itself.

The code that you're patching, pseudo relocations, is a mingw specific hack for accessing data members from another DLL without explicit dllimport directives - MSVC itself doesn't support that mechanism at all.

I will further investigate your other issues mentioned. I first encountered the issue when trying to build current trunk version of LLVM using BUILD_SHARED_LIBS on.

Ok, I feared that. FWIW, I haven't really gotten BUILD_SHARED_LIBS to work properly on windows at all.

I made a quick example to test out the situation you're trying to fix:

$ cat lib.h
#ifndef LIB_H
#define LIB_H
 
#ifdef _MSC_VER
extern __declspec(thread) int tlsvar;
#else
extern __thread int tlsvar;
#endif
 
void libfunc(void);
 
#endif
$ cat lib.c
#include "lib.h"
#include <stdio.h>
#include <windows.h>
 
#ifdef _MSC_VER
__declspec(thread) int tlsvar = 1;
#else
__thread int tlsvar = 1;
#endif
 
void libfunc(void) {
        printf("thread %d, address of tlsvar %p, value %d\n", GetCurrentThreadId(), &tlsvar, tlsvar);
}
$ cat main.c
#include "lib.h"
#include <stdio.h>
#include <windows.h>
#include <process.h>

static unsigned __stdcall threadfunc(void *arg) {
        tlsvar = 20;
        libfunc();
        return 0;
}
 
int main(int argc, char **argv) {
        HANDLE thread;
        tlsvar = 10;
        libfunc();
        WaitForSingleObject((HANDLE)_beginthreadex(NULL, 0, threadfunc, NULL, 0, NULL), INFINITE);
        libfunc();
        return 0;
}

If I build this and link it together and run it into one single executable, it works just fine:

$ cl -nologo main.c lib.c -Femsvc.exe
main.c
lib.c
Generating Code...
$ wine msvc.exe
thread 430, address of tlsvar 0011080C, value 10
thread 377, address of tlsvar 00114E0C, value 20
thread 430, address of tlsvar 0011080C, value 10
$ i686-w64-mingw32-gcc main.c lib.c -o gcc.exe
$ wine gcc.exe
thread 478, address of tlsvar 0023148C, value 10
thread 316, address of tlsvar 00231534, value 20
thread 478, address of tlsvar 0023148C, value 10

Now if I force the two object files into separate modules, it no longer works:

$ i686-w64-mingw32-gcc lib.c -shared -o lib.dll -Wl,--out-implib,lib.lib 
$ i686-w64-mingw32-gcc main.c -o gcc-shared.exe lib.lib
$ cp /usr/lib/gcc/i686-w64-mingw32/7.3-win32/libgcc_s_sjlj-1.dll .
$ wine gcc-shared.exe 
thread 463, address of tlsvar 00231584, value 1
thread 360, address of tlsvar 002316D4, value 1
thread 463, address of tlsvar 00231584, value 1

(It doesn't work, as the write to tlsvar in main.c isn't visible when read from libfunc in lib.c.)

If I try the same with clang+lld (patched with this patch and the same for i386):

$ i686-w64-mingw32-clang lib.c -shared -o lib.dll -Wl,--out-implib,lib.lib  
$ i686-w64-mingw32-clang main.c -o clang-shared.exe lib.lib
$ wine clang-shared.exe
wine: Unhandled page fault on write access to 0x0fd13c9c at address 0x40144d (thread 006e), starting debugger... 
018e:err:winediag:nodrv_CreateWindow Application tried to create a window, but no driver could be loaded.
018e:err:winediag:nodrv_CreateWindow Make sure that your X server is running and that $DISPLAY is set correctly. 
Unhandled exception: page fault on write access to 0x0fd13c9c in 32-bit code (0x0040144d).
winedbg: Internal crash at 0x7fa87e33

So, just as predicted, the SECREL which was supposed to be a plain section offset, was made into a full relative offset to the variable in the other DLL, which causes an enormous overflow of the allocated TLS block and thus crashes.

If I try to do the same with MSVC, it errors out as automatic dllimport of variables isn't supported there:

$ cl -nologo lib.c -link -dll -out:lib.dll -implib:lib.lib
lib.c
$ cl -nologo main.c lib.lib -Femsvc-shared.exe
main.c
main.obj : error LNK2019: unresolved external symbol _tlsvar referenced in function _threadfunc@4
msvc-shared.exe : fatal error LNK1120: 1 unresolved externals

If I instead add __declspec(dllimport) in lib.h when compiling main.c, I get this other very telling error message:

$ cl -nologo main.c lib.lib -Femsvc-shared.exe
main.c
lib.h(6): error C2492: 'tlsvar': data with thread storage duration may not have dll interface

Thank you for your very detailed explanation. Do you think it'd be worth updating documentation and cmake then to error if threading is enabled and build shared libs chosen? Either way I think this can be closed. Thanks again!

zero9178 abandoned this revision.Jan 28 2020, 1:16 PM

Thank you for your very detailed explanation. Do you think it'd be worth updating documentation and cmake then to error if threading is enabled and build shared libs chosen?

Hmm, maybe, not sure - not sure what the situation is with e.g. MSVC and BUILD_SHARED_LIBS either; I guess it isn't supported (as there's no mechanism for adding dllimport attibutes in all the internal interfaces where it would be necessary), but not sure if it errors out loudly or if it's just practically unsupported.

It has so far been kind of almost possible with mingw, thanks to the automatic import of data members from DLLs, but this issue (with the recent introduction of tls variables shared between object files) certainly makes it harder. And in addition to this, iirc there were some weird issues where both an import library and a static library of the same was linked, causing various weird issues. I haven't had time to look into that whole build setup lately - I just have some brief notes from trying, that there were issues.

FWIW, it seems like BUILD_SHARED_LIBS is explicitly disallowed in MSVC configs: https://github.com/llvm/llvm-project/blob/master/llvm/CMakeLists.txt#L555-L561

However people do seem to be building LLVM with this option for mingw configs: https://github.com/msys2/MINGW-packages/pull/6098 But do note that they're setting --allow-multiple-definition to avoid issues with multiple definitions (which was the issue I remember running into); I'm not entirely convinced setting that option is a good idea, but apparently it works well enough for them. If you have time to spare, looking into that and trying to avoid it properly would be appreciated.

So in general BUILD_SHARED_LIBS should work for mingw, at least it is not very far from working, so it's probably better to spend effort on making it work than to explicit forbid it.

Regarding the thread_local variable, I think the absolute safest thing to do would be to fix llvm/Support/TimeProfiler.h to only access the TimeTraceProfilerInstance variable via getter/setter functions (defined in lib/Support/TimeProfiler.cpp); then it should work fine across DLL boundaries and would avoid a lot of tricky TLS cases. Or instead of two getter/setter functions, just a wrapper function that returns a pointer. Then it could even be made transparent like this:

#ifdef _WIN32
// Actual LLVM_THREAD_LOCAL TimeTraceProfiler * variable is defined as static in TimeProfiler.cpp
TimeTraceProfiler **getTimeTraceProfilerInstancePtr();
#define TimeTraceProfilerInstance (*getTimeTraceProfilerInstancePtr())
#else
extern LLVM_THREAD_LOCAL TimeTraceProfiler *TimeTraceProfilerInstance;
#endif

(Not sure if such define trickery is appreciated, or if it's better to just rewrite the code in TimeProfiler.h to unconditionally use that wrapper function.)

If a wrapper function returning a pointer would work over dll boundaries then I think that would be the best way going forward. I will look into it! I have not encountered the multiple definitions error when building with shared libraries. I have encountered one when building current trunk as static libraries but I saw that you have fixed that for clang 10 already (was related to comdat and TLS)

If a wrapper function returning a pointer would work over dll boundaries then I think that would be the best way going forward.

Yes, that should work cleanly. In that case, there's no TLS access over the DLL boundary; you just have a normal function call to a function in another DLL, which then returns the address of the calling thread's copy of the TLS variable within that DLL.

I will look into it! I have not encountered the multiple definitions error when building with shared libraries.

I don't remeber exactly which component it was in, it could have been in some component which isn't always built (maybe in some unit test, but I don't think I would have built those in that configuration anyway...) If it works fine for you (after fixing the TLS variable) I could give it another shot to see where I found issues.

I have encountered one when building current trunk as static libraries but I saw that you have fixed that for clang 10 already (was related to comdat and TLS)

Yeah, the TLS variable exposed a bunch of issues relating to TLS access across object files, but that should be fixed now. But if it's accessed via a wrapper function, those bugs (that are fixed now) aren't hit anymore either, as it boils it down to the simplest case, a TLS variable that only is accessed within one single object file.

I will look into it! I have not encountered the multiple definitions error when building with shared libraries.

I don't remeber exactly which component it was in, it could have been in some component which isn't always built (maybe in some unit test, but I don't think I would have built those in that configuration anyway...) If it works fine for you (after fixing the TLS variable) I could give it another shot to see where I found issues.

I tested, and BUILD_SHARED_LIBS kind of worked, more or less, except for a few cases for which I've sent patches now.

The other thing I remembered was building with -DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=ON; this fails pretty badly as it links both the big lib containing all of llvm, and the invidual libs, causing issues with duplication.