This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Add support for init functions linking metadata
ClosedPublic

Authored by sbc100 on Dec 13 2017, 4:24 PM.

Details

Summary

This change lays the groundwork lowering of @llvm.global_ctors
and @llvm.global_dtors for the wasm object format. Some parts
of this patch are subset of: https://reviews.llvm.org/D40759

See https://github.com/WebAssembly/tool-conventions/issues/25

Event Timeline

sbc100 created this revision.Dec 13 2017, 4:24 PM
ncw accepted this revision.Dec 14 2017, 5:03 AM

Looks good.

At the back of my mind, I'm just wondering about how Comdat interacts with this. In the .init_array it's fairly simple, since the data section containing the function pointer is handled just like any other data section and is included/excluded as necessary. Is it ever possible for a global ctor to be registered in a COMDAT, maybe for static class members in template classes? In that case we'll need to remember to check that duplicate INIT_FUNCS are discarded (if indeed they can participate in Comdat discarding).

I think the right way to model that would be in the Comdat data. (This is all assuming we do Comdat groups...) Currently Comdat group data is a vector of typed function/data-segment indices. Now that INIT_FUNCS aren't in the .init_array data segment, we'll need to make the Comdat group to be a vector of type function/data-segment/INIT_FUNCS indices.

lib/Object/WasmObjectFile.cpp
675

You've changed the behaviour of parseStartSection, which used to take a function index excluding the imported functions.

Is that deliberate? These indexes are slippery... I would have thought that both parseStartSection and WASM_INIT_FUNCS should only be able to specify a defined function, ie the range should not include the imported functions.

This revision is now accepted and ready to land.Dec 14 2017, 5:03 AM
In D41208#955130, @ncw wrote:

Looks good.

At the back of my mind, I'm just wondering about how Comdat interacts with this. In the .init_array it's fairly simple, since the data section containing the function pointer is handled just like any other data section and is included/excluded as necessary. Is it ever possible for a global ctor to be registered in a COMDAT, maybe for static class members in template classes? In that case we'll need to remember to check that duplicate INIT_FUNCS are discarded (if indeed they can participate in Comdat discarding).

I think the right way to model that would be in the Comdat data. (This is all assuming we do Comdat groups...) Currently Comdat group data is a vector of typed function/data-segment indices. Now that INIT_FUNCS aren't in the .init_array data segment, we'll need to make the Comdat group to be a vector of type function/data-segment/INIT_FUNCS indices.

LLVM handles this in @llvm.global_ctors by making each entry consist of:

  • the constructor function address
  • a priority value
  • an "associated" symbol, which when non-null refers to a comdat group

So, another approach would be to do something similar, and add a comdat symbol to the WASM_START_FUNCS entries (see also here), which would reflect LLVM's structure pretty closely. I don't have a strong opinion about whether this is better than doing it within the comdat metadata section though.

lib/Object/WasmObjectFile.cpp
675

The underlying wasm start function takes a plain function index, which includes the imports, so I think it does make sense for low-level tools like this to use the same index space.

If there are tools where it's complex to support this, I think it's fine to issue an error if the index is an import, but using the same index space everywhere is one way to keep a semblance of a foothold in this slippery terrain.

ncw added a comment.Dec 14 2017, 6:31 AM

So, another approach would be to do something similar, and add a comdat symbol to the WASM_START_FUNCS entries (see also here), which would reflect LLVM's structure pretty closely. I don't have a strong opinion about whether this is better than doing it within the comdat metadata section though.

Either way works. My preference for putting things in the Comdat metadata section is simply to avoid having to write out the string name for the Comdat multiple times (which could be veeery long for STL templated methods!) - or alternatively avoids referring to Comdats by some "Comdat index" which is a bit more awkward because then you'd have to make the list of Comdats up-front and it's just (marginally) more work.

lib/Object/WasmObjectFile.cpp
675

That makes sense. I guess it should then have a branch added && Index >= NumImportedFunctions

In D41208#955217, @ncw wrote:

So, another approach would be to do something similar, and add a comdat symbol to the WASM_START_FUNCS entries (see also here), which would reflect LLVM's structure pretty closely. I don't have a strong opinion about whether this is better than doing it within the comdat metadata section though.

Either way works. My preference for putting things in the Comdat metadata section is simply to avoid having to write out the string name for the Comdat multiple times (which could be veeery long for STL templated methods!) - or alternatively avoids referring to Comdats by some "Comdat index" which is a bit more awkward because then you'd have to make the list of Comdats up-front and it's just (marginally) more work.

My preference would be always refer to symbols, rather than including strings in either the COMDAT section or the init funcs. So one can refer to a COMDAT by symbol. I think this is how ELF does it. We don't have a great notion of symbol index right now (because some symbols and functions and some are globals), but I think we need one. Dan proposed doing this for the SYMBOL_INFO metadata already and I think we should do that.

lib/Object/WasmObjectFile.cpp
675

This is basically a bug fix from Dan's branch. I think the new code is correct. Start function was always meant to be in the function index space and the check was simply wrong before.

ncw added a comment.Dec 14 2017, 9:11 AM

My preference would be always refer to symbols, rather than including strings in either the COMDAT section or the init funcs. So one can refer to a COMDAT by symbol. I think this is how ELF does it. We don't have a great notion of symbol index right now (because some symbols and functions and some are globals), but I think we need one. Dan proposed doing this for the SYMBOL_INFO metadata already and I think we should do that.

I think the Comdat metadata will should include a string name for the Comdat - although as an optimisation I suppose we could point to an index in the "names" metadata. The LLVM IR fully supports the notion of named Comdats, and I don't think we'd want to impose too much on the different language frontends, ie we want to be compatible with any legal IR that a frontend might produce, with lowering passes for anything we don't want to support. A Comdat could contain multiple symbols or chunks of data, and could have its own name - otherwise Haskell or Rust or who knows what will complain.

In ELF, the Comdat is named independently from the symbol(s) it contains. It's just that the Comdat's name and the symbol names are both indexes into the same table of names, so the string data itself isn't duplicated in the ELF binary. They are still named independently though, even though it's overwhelmingly common for the Comdat's name to match the name of an existing symbol.

In D41208#955465, @ncw wrote:

My preference would be always refer to symbols, rather than including strings in either the COMDAT section or the init funcs. So one can refer to a COMDAT by symbol. I think this is how ELF does it. We don't have a great notion of symbol index right now (because some symbols and functions and some are globals), but I think we need one. Dan proposed doing this for the SYMBOL_INFO metadata already and I think we should do that.

I think the Comdat metadata will should include a string name for the Comdat - although as an optimisation I suppose we could point to an index in the "names" metadata. The LLVM IR fully supports the notion of named Comdats, and I don't think we'd want to impose too much on the different language frontends, ie we want to be compatible with any legal IR that a frontend might produce, with lowering passes for anything we don't want to support. A Comdat could contain multiple symbols or chunks of data, and could have its own name - otherwise Haskell or Rust or who knows what will complain.

In ELF, the Comdat is named independently from the symbol(s) it contains. It's just that the Comdat's name and the symbol names are both indexes into the same table of names, so the string data itself isn't duplicated in the ELF binary. They are still named independently though, even though it's overwhelmingly common for the Comdat's name to match the name of an existing symbol.

I think maybe you misunderstood or I wasn't clear. I wasn't suggesting dropping named COMDATs. Just that we should use the symbol tables to store the strings, like ELF does. So the name of the COMDAT and then name of a function can be stored in the same string/symbol table. Basically I want to make symbol a more first class thing.

ncw added a comment.Dec 14 2017, 10:18 AM

Ah, that sounds good.

I wonder what will happen to Comdats anyway... I'm just waiting to see, as an outsider it's not really my place to prod any decisions! I'm still a bit bemused why there's so much resistance to Comdat groups - which are simple and model the C++ and LLVM IR semantics exactly. It just feels like what we're being asked to implement, with the SHF_LINK_ORDER thing, is a semi-untested idea, given that it's not actually how the ELF does things currently (it applies Comdat groups first). I'm also a bit uneasy about relying so heavily on GC, so that rather than just being a filesize optimisation it becomes a vital component to the linking model's semantics. Even if we did have GC, doing Comdat discarding first would be a reassuring way of getting the "right" semantics, so that GC can be tweaked without worrying so much about introducing bugs. Not up to me to decide though!

In D41208#955548, @ncw wrote:

Ah, that sounds good.

I wonder what will happen to Comdats anyway... I'm just waiting to see, as an outsider it's not really my place to prod any decisions! I'm still a bit bemused why there's so much resistance to Comdat groups - which are simple and model the C++ and LLVM IR semantics exactly. It just feels like what we're being asked to implement, with the SHF_LINK_ORDER thing, is a semi-untested idea, given that it's not actually how the ELF does things currently (it applies Comdat groups first). I'm also a bit uneasy about relying so heavily on GC, so that rather than just being a filesize optimisation it becomes a vital component to the linking model's semantics. Even if we did have GC, doing Comdat discarding first would be a reassuring way of getting the "right" semantics, so that GC can be tweaked without worrying so much about introducing bugs. Not up to me to decide though!

You opinion (as someone who has been doing a lot of the work in this area) certainly does matter here. I think its worth a little more consideration but ultimately if COMDAT support you have implemented does the job of supporting C++ needs and the alternative is more complicated then we have a good argument for landing it.

I would suggest that we look bit more at the COFF associative COMDATs (which is apparently similar, but more obviously designed to implement COMDAT). My understanding is that that technique basically still results in named groups, but that the groups are defined by a chain (each section depends on at most one other one, and each group points to the top of a chain).

This revision was automatically updated to reflect the committed changes.