This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Add support for __init_array_start and friends
AbandonedPublic

Authored by sbc100 on Dec 1 2017, 5:21 PM.

Details

Reviewers
ruiu

Event Timeline

sbc100 created this revision.Dec 1 2017, 5:21 PM
ncw added a subscriber: ncw.EditedDec 2 2017, 1:53 AM

Nice approach - some things are bit shorter in your version.

A couple of things are a bit more general in mine I think, which uses some more boilerplate, but it emulates more accurately the linkage behaviour of native ELF.

  • The toolchain might define __init_array_start etc via crtbegin.o. By defining the symbol up-front, as a weak symbol, that does allow the translation units to provide the symbols if necessary.
  • You also need to make sure to provide __init_aray_start even if there isn't a .init_array segment (libc needs to know there are zero constructors to iterate over)
  • Is it worth pulling over my tests as well, which model real-world C++ usage

PS. I'm about half-way through getting COMDAT working, I'm not duplicating your work there am I? When COMDAT support is there, I think C++ minus exceptions will be roughly complete!

sbc100 added a comment.Dec 2 2017, 9:23 AM

I believe __init_aray_start is weakly referenced by libc.. so when its not defined it ends up being null, which means it will not iterate over any initializes. If this wasn't true we'd be seeing link failures with undefined symbols in the tests we are doing today.

sbc100 added a comment.Dec 2 2017, 9:28 AM

Regarding COMDAT, I was doing some research to see if we can get away with only implementing linkonce_odr and avoid implement the full semantics of COMDAT. This is how mach-o works today, but its not clear to me yet how significant the disadventages are.

I am actively working on (or at least thinking about) this stuff. Very happy for your contributions though! Lets just try to communicate more about the ongoing work. How about we open bugs for all these things we are working on. We can also sync up via email and IRC (I'm sbc100 on llvm and webassembly IRC).

ruiu added a subscriber: ruiu.Dec 4 2017, 10:23 PM
ruiu added inline comments.
wasm/SymbolTable.cpp
136

Does this function expect that a given name exists in the hash table? It looks like it does nothing if a given name collides with an existing name. I wonder if it is the right behavior.

If you think that a name given to this function is reserved and shouldn't be user-defined, it is better to emit an error message if there's an existing symbol in the symbol table, isn't it?

ncw added a comment.Dec 5 2017, 9:48 AM

I believe __init_aray_start is weakly referenced by libc.. so when its not defined it ends up being null, which means it will not iterate over any initializes. If this wasn't true we'd be seeing link failures with undefined symbols in the tests we are doing today.

Are you sure? I don't mean to contradict, but with a master build, I find that undefined weak symbols are set to (uint32_t)-1 when not found, and secondly that the output has a global import for __init_array_start if I link in anything from libc that uses it.

My test:

// File test.c
int main() { return 0; }

Commands:

clang -o /tmp/test.o -c /tmp/test.c
wasm-ld -L/home/ncw/workspace/llvm/compile-root/usr/lib /tmp/test.o /home/ncw/workspace/llvm/compile-root/usr/lib/crt1.o -lc -o /tmp/test --allow-undefined

Output contains:

(import "env" "__syscall_set_tid_address" (func $import$0 (param i32 i32) (result i32)))
(import "env" "__syscall_poll" (func $import$1 (param i32 i32) (result i32)))
(import "env" "__syscall_open" (func $import$2 (param i32 i32) (result i32)))
(import "env" "__syscall_exit_group" (func $import$3 (param i32 i32) (result i32)))
(import "env" "__syscall_exit" (func $import$4 (param i32 i32) (result i32)))
(import "env" "__init_array_start" (global $import$5 i32))
(import "env" "__init_array_end" (global $import$6 i32))
wasm/SymbolTable.cpp
136

It's not reserved, I think it's OK for library code to provide this symbol, if the libc implementation is using the crtbegin/crtend trick. I think we should just provide a built-in weak definition of the symbol that takes lowest priority.

I'm not sure though that it really matters what happens if the user tries to provide __init_array_start - as long as things work with musl anyone not using musl can just muddle through!

ncw added a comment.Dec 5 2017, 10:25 AM

Regarding COMDAT, I was doing some research to see if we can get away with only implementing linkonce_odr and avoid implement the full semantics of COMDAT. This is how mach-o works today, but its not clear to me yet how significant the disadventages are.

I am actively working on (or at least thinking about) this stuff. Very happy for your contributions though! Lets just try to communicate more about the ongoing work. How about we open bugs for all these things we are working on. We can also sync up via email and IRC (I'm sbc100 on llvm and webassembly IRC).

I've taken the plunge and had a go just doing COMDAT "properly". I had a look at linkonce_odr, but I wasn't sure it was going to save much time, since the hard work is all in LLD and consists of improvements to the actual function and data segment linkage. We'll have to put in that work regardless.

What I've done is open three more bugzilla bugs for discussion:

I've implemented a first-cut of the core COMDAT support, and the additional two bugs are non-functional, and simply reduce the binary size without affecting behaviour.

I'm also opening two work-in-progress Phabricator reviews, just to show where I've got to so far.

Please feel free to email ncw@realvnc.com or comment on the bugzilla tickets. I work from Cambridge (UK) and I think you're in GMT+8, so not much overlap, but I'll be on IRC when possible.

In D40760#945281, @ncw wrote:

I believe __init_aray_start is weakly referenced by libc.. so when its not defined it ends up being null, which means it will not iterate over any initializes. If this wasn't true we'd be seeing link failures with undefined symbols in the tests we are doing today.

Are you sure? I don't mean to contradict, but with a master build, I find that undefined weak symbols are set to (uint32_t)-1 when not found, and secondly that the output has a global import for __init_array_start if I link in anything from libc that uses it.

My test:

// File test.c
int main() { return 0; }

Commands:

clang -o /tmp/test.o -c /tmp/test.c
wasm-ld -L/home/ncw/workspace/llvm/compile-root/usr/lib /tmp/test.o /home/ncw/workspace/llvm/compile-root/usr/lib/crt1.o -lc -o /tmp/test --allow-undefined

Output contains:

(import "env" "__syscall_set_tid_address" (func $import$0 (param i32 i32) (result i32)))
(import "env" "__syscall_poll" (func $import$1 (param i32 i32) (result i32)))
(import "env" "__syscall_open" (func $import$2 (param i32 i32) (result i32)))
(import "env" "__syscall_exit_group" (func $import$3 (param i32 i32) (result i32)))
(import "env" "__syscall_exit" (func $import$4 (param i32 i32) (result i32)))
(import "env" "__init_array_start" (global $import$5 i32))
(import "env" "__init_array_end" (global $import$6 i32))

Hmm.. I can't see the weak reference either now that you mention it. But now I'm confused about why I've never seen undefined errors for those symbols. I'll have to investigate.

BTW, I don't see the same set of undefined symbols as you. Can you build without --allow-undefined and instead use -allow-undefined-file /path/to/wasm.syms (this is what the clang frontend currently does). Also, you must have different musl version to me because the one we use on the waterfall should only depend on env.__syscall3 etc (not the syscalls that include the name).

wasm/SymbolTable.cpp
136

Lets just mimic whatever the native/ELF toolchains do for this. If they allow user-provided __init_array_start then we should too and vice versa.

sbc100 added a comment.Dec 5 2017, 5:28 PM

Actually they are weakly referenced in musl:
https://github.com/jfbastien/musl/blob/wasm-prototype-1/src/env/__libc_start_main.c#L14

__attribute__((__weak__, __visibility__("hidden")))                              
extern void (*const __init_array_start)(void), (*const __init_array_end)(void);

But it looks to me like the ELF linker always adds them anyway, so maybe we should just do that.

ncw added a comment.Dec 6 2017, 8:00 AM

I had a go with this patch, to try and rebase it on top of the synthetic symbols changes you merged recently.

I found that quite a number of tests fail with this patch? In the changes above, you've made it so that the stack pointer symbol isn't given a global declaration anymore, which surely can't be right. Did this diff maybe come from splitting up a larger patch (eg splitting off chunks from D40859) and leaving out a few of the necessary chunks?

I don't think this is usable as-is, but, it's very helpful to see what direction you're trying to nudge things towards.

I decided the most helpful thing would be to rebase this patch, and combine it with my changes in D40614, and I've updated D40614 with the result. Once your global offset changes land (D40859) there'll be a merge needed, but I'm happy to do that in D40614.

ruiu accepted this revision.Dec 6 2017, 3:38 PM

LGTM

This revision is now accepted and ready to land.Dec 6 2017, 3:38 PM

For the record, I believe that it would be better to work towards implementing the proposal here:

https://github.com/WebAssembly/tool-conventions/issues/25

sbc100 added a comment.Dec 6 2017, 3:47 PM

I'm not quite ready to land this anyway. It needs rebasing and integrating with @ncw proposed version.

Regarding @sunfish's alternative solution, I am still on the fence about the benefits (specifically the security benefit of not have constructors in the indirect function table) vs the costs (having to teach the linker how to synthesize the constructor calling function, and deviating from ELF). I would also be open to landing this first and then switching.

ncw added a comment.Dec 7 2017, 9:59 AM

Dan's idea in D40759 is nice!

I listed a lot of "Cons" in D40614, partly because being ELF-like feels "safer", but overall the dedicated start-function-linkage does seem to be a better approach.

I guess there's a question of whether it's worth merging at all this .init_array support Sam (and I) did...? I've integrated Sam's work with mine and fixed the tests, so that could be merged for the time being. From my point of view, the .init_array stuff at least gets Musl working right away.

(Musl aside)

It looks like you guys are using Musl from here: https://github.com/jfbastien/musl. I tried using it, but some things seem to be missing - for some reason dynamic linking was apparently higher priority than working malloc/free!

I've had a fresh go at building a native Wasm Musl port, and I've actually been trying to submit those patches upstream. I've got a working malloc/free/mmap/munmap, and the syscalls are fully statically linked, and come out as imports on the Wasm module with names like "__syscall_open".

Here's the repo with my Musl, which builds and runs out-of-the-box with the .init_array patch: https://github.com/NWilson/musl/pull/1

I've also made a set of instructions for myself and some colleagues on how I'm building Musl at the moment:
https://gist.github.com/NWilson/b0d7a643b66bcd4c118b26d979572c5b

As and when Dan's @llvm.global_ctors change lands, I'll certainly be updating my Musl to match.

I've put in a bit of effort on Musl, it would be nice to package that in a way that's useful to you guys so that it gets used.

I think I'm going to abandon this, but the work in this patch and in Nicolas's patch will be probably still end up being generally useful.

Nicolas, regarding musl, I suggest we try to concentrate our efforts.
There is discussion here about moving the repo under WebAssmembly: https://github.com/jfbastien/musl/issues/15
And another one here about standardizing on the syscall API: https://github.com/WebAssembly/tool-conventions/issues/27

I'd be wary of upstreaming just yet, except for the most obvious changes. Certainly it would be long term goal. Have you had much success landing patches there? In the past I've heard the musl developers were not interested in ports to non-linux platforms.

I definitely like the idea of each syscall being its own import. What do you think about upstream your patches to https://github.com/jfbastien/musl?

sbc100 abandoned this revision.Dec 7 2017, 10:35 AM