Page MenuHomePhabricator

[WebAssembly] Add explicit symbol table
ClosedPublic

Authored by sbc100 on Feb 13 2018, 3:51 PM.

Details

Summary

This change modified lld to in response the llvm change which
moved to a more explicit symbol table in the object format.

Based on patches by Nicholas Wilson:

  1. https://reviews.llvm.org/D41955
  2. https://reviews.llvm.org/D42585

The primary difference that we see in the test output is that
for relocatable (-r) output we now have symbol table which
replaces exports/imports and globals.

See: https://github.com/WebAssembly/tool-conventions/issues/38

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sbc100 added a comment.EditedFeb 15 2018, 5:15 PM

In the old code we used DefinedGlobal to meant global data in linear memory. In the new code that got renamed to DefinedData and a new type of entity called DefinedGlobal was added for wasm globals. Its a shame it has the same name in the new code, but I think it makes sense.

To make this even more complicated to follow, we previously modeled data symbols as wasm globals... i.e. we use the wasm globals in the object format to store the addresses in linear memory of data symbols. A little confusing, and one reason why we are making this change to have a more explicit symbol table.

We could simply store the wasm global data in the DefinedGlobal itself, rather then having it point to an InputGlobal. I think the problem there is when more than one symbol aliases the same global. But, we don't have the case yet so perhaps we can push that down the line a little. Let me see if I can simplify things.

ruiu added a comment.Feb 15 2018, 5:19 PM

We could simply store the wasm global data in the DefinedGlobal itself, rather then having it point to an InputGlobal. I think the problem there is when more than one symbol aliases the same global. But, we don't have the case yet so perhaps we can push that down the line a little. Let me see if I can simplify things.

What is your concern on aliasing? In general, symbols can have aliases (by just pointing to the same offset in the same section). We don't do anything special about it in other file formats, and it just works fine -- we don't need to find if a symbol has aliases or not to generate a correct output. I'm wondering if there's some wasm-specific concern about it.

ncw added a comment.Feb 15 2018, 5:38 PM

To make this even more complicated to follow, we previously modeled data symbols as wasm globals... i.e. we use the wasm globals in the object format to store the addresses in linear memory of data symbols. A little confusing, and one reason why we are making this change to have a more explicit

We could simply store the wasm global data in the DefinedGlobal itself, rather then having it point to an InputGlobal. I think the problem there is when more than one symbol aliases the same global. But, we don't have the case yet so perhaps we can push that down the line a little. Let me see if I can simplify things.

I'm not sure it would be simpler to put the definition (initialiser) for the symbol in the symbol itself. As you say, it's conceptually wrong because it doesn't work with aliases (unused in this case, but it demonstates the model wouldn't be right); secondly it breaks the link between defined symbols and their "chunk" containing the initializer; and thirdly, worst, prevents unifying the handling of globals/data (eg with comdat for globals, then symbols could then be initialized in the same way for all types).

I think we should aim to have globals work in the same way in LLD as data/function symbols, to keep the mental model simple, rather than arbitrarily constraing them with different resolution semantics (eg coding up a way of defining global symbols that doesn't allow aliases).

  • Globals variables basically consists of types and indices

And an initializer! That's what distinguishes a defined global from an imported/undefined one, in the same way that the initializer (bss or data section) does for a data symbol, or the function body does for a function symbol.

It looks more and more like global variables are just symbols with indices. Do you agree? I believe it can naturally be modeled as a symbol.

That makes me wonder what the current DefinedGlobal represents. Is this a symbol that lives in the linear memory or global variable?

DefinedGlobal is a symbol representing storage that's only accessible via its global index, rather than via a linear memory address. "Defined" means the Wasm file allocates that storage, rather than importing the storage from another DSO. We definitely need the distinction between defined and undefined for globals.

The key takeaway is that a global is a storage/memory slot that can be allocated by a binary, just like a data symbol in ELF, it's only the addressing model that's different (via index rather than RAM address).

Thanks for looking into this!

ruiu added a comment.Feb 15 2018, 5:46 PM

And an initializer! That's what distinguishes a defined global from an imported/undefined one, in the same way that the initializer (bss or data section) does for a data symbol, or the function body does for a function symbol.

An initializer can just be a member of a symbol, no?

Global variables look very similar to imported functions. There's no "body" of imported symbols, and therefore there's no chunk that represents imported function body, even though imported symbols are "defined" (as opposed to "undefined" symbols that have to be resolved until the end of linking). Symbols doesn't have to have chunks, and that's not actually odd.

ncw added a comment.Feb 15 2018, 5:46 PM

Aliasing works because of InputSection/InputChunk, which allows multiple symbols to reference the same chunk of memory, without having the section or body duplicated in the output. If you get rid of InputGlobal, you'll have aliasing issues, because the output will duplicate the global, causing it to be allocated storage twice - just like if a data section aliased by two symbols would chew up double the RAM if the symbols weren't able to share the InputSection. Plus the semantics would be wrong because writes to one alias should be visible by reads to the other.

ruiu added a comment.Feb 15 2018, 5:52 PM

Aliasing works because of InputSection/InputChunk, which allows multiple symbols to reference the same chunk of memory, without having the section or body duplicated in the output. If you get rid of InputGlobal, you'll have aliasing issues, because the output will duplicate the global, causing it to be allocated storage twice - just like if a data section aliased by two symbols would chew up double the RAM if the symbols weren't able to share the InputSection. Plus the semantics would be wrong because writes to one alias should be visible by reads to the other.

I don't think you need to use InputChunk to address the issue. You can just maintain the indices of globals the linker has already emitted so that it doesn't assign different indices to two Globals that originally had the same index, for example. There may be easier way of doing it than that.

ncw added a comment.Feb 15 2018, 5:56 PM

An initializer can just be a member of a symbol, no?

By that logic, you wouldn't need InputSection at if all in the ELF linker. You just put the body of each symbol on the symbols, rather than allowing multiple symbols to defence the same section.

Global variables look very similar to imported functions. There's no "body" of imported symbols, and therefore there's no chunk that represents imported function body, even though imported symbols are "defined" (as opposed to "undefined" symbols that have to be resolved until the end of linking). Symbols doesn't have to have chunks, and that's not actually odd.

Globals are really more like data symbols. An int variable would have a 4-byte section asking the ELF interpreter to allocate 4 bytes of ram. A wasm global I32 has its own section in the wasm file, asking for a 4byte register-backed region to be allocated.

Do we have a different understanding of "imported"? An imported function is definitely undefined from a wasm and a linking point of view - both in ELF and warm, the function symbols with no body are undefined and the linker is looking for a definition against which to reduce any calls.

ncw added a comment.Feb 15 2018, 6:05 PM

I don't think you need to use InputChunk to address the issue. You can just maintain the indices of globals the linker has already emitted so that it doesn't assign different indices to two Globals that originally had the same index, for example. There may be easier way of doing it than that.

That's true, I admit. But you haven't done that for functions (even though it would be equally possible to get rid of InputSection or InputFunction in the same way). I just thinks it really helps comprehension if all the symbol clases are implemented in the same way, rather than differing in ad-hoc ways. Wanting DefinedFunction to work like DefinedGlobal is similar to wanting the wasm Symbol class to match the ELF symbol class.

I don't want to argue though - I'll go with your preference as the code owners, now I've (hopefully) explained my reasoning.

ruiu added a comment.Feb 15 2018, 6:05 PM

By that logic, you wouldn't need InputSection at if all in the ELF linker. You just put the body of each symbol on the symbols, rather than allowing multiple symbols to defence the same section.

I don't know if I understand what you wrote correctly. In ELF, you need both symbols and sections in the linker. One section can have multiple symbols at its various offsets, and some of them are aliases (e.g. two symbols can point to the same location in the same section).

Globals are really more like data symbols. An int variable would have a 4-byte section asking the ELF interpreter to allocate 4 bytes of ram. A wasm global I32 has its own section in the wasm file, asking for a 4byte register-backed region to be allocated.

Well, that runtime allocates some memory at runtime doesn't necessarily mean that that has to be represented as Chunk. At least to me the use of Chunk in this patch to represent Globals is very odd and is not an intended use of Chunk. Chunk is designed to represent sections or something like that, and Globals is different.

Do we have a different understanding of "imported"? An imported function is definitely undefined from a wasm and a linking point of view - both in ELF and warm, the function symbols with no body are undefined and the linker is looking for a definition against which to reduce any calls.

By imported, I mean symbols in DSO (DLL) files. In ELF, it's SharedSymbol. In COFF, It's DLLImported symbol. They are considered "defined" because they are already resolved, but the actual code is not given until load-time.

ncw added a comment.Feb 15 2018, 6:14 PM

I see about "imported", thanks! I hadn't realized that terminology in DSOs, my mistake. In wasm "imported" really is undefined.

Maybe we could compromise on keeping InputGlobal, but renaming it and making it not derive from InputChunk. That would almost compile as-is in fact, and I think maybe you've right that would be cleaner. MarkLive would have to gain support for it, but that's only a few lines of code actually.

I could try that tomorrow, or Sam could try now?

In D43264#1009610, @ncw wrote:

I see about "imported", thanks! I hadn't realized that terminology in DSOs, my mistake. In wasm "imported" really is undefined.

Maybe we could compromise on keeping InputGlobal, but renaming it and making it not derive from InputChunk. That would almost compile as-is in fact, and I think maybe you've right that would be cleaner. MarkLive would have to gain support for it, but that's only a few lines of code actually.

I could try that tomorrow, or Sam could try now?

I've been trying to do this this evening but its fairly entangled. I will try again tomorrow, perhaps with a preparatory change to precede it.

ncw added a comment.Feb 16 2018, 1:53 PM

I've been trying to do this this evening but its fairly entangled. I will try again tomorrow, perhaps with a preparatory change to precede it.

Sorry Sam, somehow this didn't catch my eye in my inbox -I realise now we might have been working on the same cleanup :( Oh well, don't worry if you go a different direction to what I did in D43391.

I can't really chat this evening - not that I'd be much help anyway!

If you and Rui can come up with a solution that works for InputGlobal - go ahead and merge! I'll take any solution, I'm only arguing for changes at the moment that are fairly close to how these patches are currently, to so that we don't end up with another big refactor on our hands before we can merge what we've got so far.

Do you have any thoughts at this point on how you think InputGlobal should fit into the hierarchy, eg like the comment in D43391?

sbc100 added a comment.EditedFeb 16 2018, 2:42 PM

It also didn't feel right to have InputGlobal to inherit from InputChunk. For me the Chunks are really for the non-synthetic sections. That is the ones construct using memory copied from the input files + relocations.

I think of the global section as being a sythentic section (indeed it is right now). So I guess I agree with ruiu here. But don't worry, I'm sure we can find a solution...we will get this thing landed soon one way or another :)

sbc100 updated this revision to Diff 135159.Feb 20 2018, 4:04 PM
  • Don't use InputChunk to model globals
sbc100 updated this revision to Diff 135163.Feb 20 2018, 4:16 PM
  • revert parts
ncw accepted this revision.Feb 21 2018, 4:01 AM

It looks OK still. It's familiar, in that I tried a similar refactor (making InputGlobal not inherit InputChunk), and I can see you've faced the same problems.

I think it works, just some comments about consistency below.

wasm/MarkLive.cpp
45–47 ↗(On Diff #135163)

You're not marking globals as Live, so how will they avoid being GC'd? You need something like this to be added I think:

if (DefinedGlobal *G = dyn_cast<DefinedGlobal>(Sym)) {
  G->Global->Live = true;
  return;
}
InputChunk *Chunk = ...

It's probably just slipped under the radar because the stack-pointer is a synthetic global, and there aren't any tests yet for non-synthetic ones.

I have some tests I'm waiting to commit for more global support, I guess we could fix GC for defined globals then.

wasm/SymbolTable.cpp
147–148 ↗(On Diff #135163)

It's odd that this is the one case without an assert(WasInserted) - I've forgotten, is there a reason we can't assert here like we do in addSyntheticFunction/Global?

wasm/Symbols.cpp
49–51 ↗(On Diff #135163)

Missing specialisation for DefinedGlobal here; hasOutputIndex should definitely match the implementation for getOutputIndex below!

166–173 ↗(On Diff #135163)

Shouldn't this method go away too now? We got rid of it on master for FunctionSymbol, so GlobalSymbol should match that unless there's a good reason to use the old scheme still.

wasm/Writer.cpp
144 ↗(On Diff #135163)

Should this unique_ptr<InputGlobal> still be here? There are now two of these, one in the Linker and one here. (Personally, I think it makes sense to own the stack-pointer here and get rid of the InputGlobal in Linker, so that CtorFunction and StackPointer behave in the same way; I can't immediately think of a reason why they shouldn't...)

721–723 ↗(On Diff #135163)

Also needs a special branch like in MarkLive:

if (DefinedGlobal* G = dyn_cast<DefinedGlobal>(Sym))
  if (G->Global && !G->Global->Live)
    continue;
}
InputChunk *Chunk = Sym->getChunk();
if (Chunk && !Chunk->Live)
  continue;

I know it's tedious... but if you're going to have the Live flag on InputGlobal (which I agree we do want) then we have to check it separately everywhere that we currently check the InputChunk::Live flag. That's the nuisancesome consequence of splitting InputGlobal off from InputChunk, but we can stomach that cost for now.

804–809 ↗(On Diff #135163)

?? Shouldn't this be done in MarkLive; why should globals be different to functions in this regard?

sbc100 updated this revision to Diff 135275.Feb 21 2018, 9:26 AM
sbc100 marked 10 inline comments as done.
  • feedback
wasm/MarkLive.cpp
45–47 ↗(On Diff #135163)

Globals (like Types) are always generated precisely regardless of --gc-sections, and so are not handled here.

wasm/SymbolTable.cpp
147–148 ↗(On Diff #135163)

Oops, that must have been merge conflict. Restored.

wasm/Writer.cpp
721–723 ↗(On Diff #135163)

I think we will want an isLive() method on Symbol, like the COFF linker has. I will add that before landing this change.

804–809 ↗(On Diff #135163)

When I added GC I decided to make it only apply to functions and data (chunks). For synthetic sections like the Type section, and now the Global section, I decided to always be precise and write only the types/globals that are needed. This means that you effectivly get type and global GC regardless of the --gc-sections flag. The code above marks the types are live already, this just extends it to handle globals

ruiu added inline comments.Feb 21 2018, 9:40 AM
test/wasm/many-functions.ll
14 ↗(On Diff #135159)

I know this test tests "many functions", but it seems a bit too many. Do you really need this many functions?

wasm/Driver.cpp
309 ↗(On Diff #135163)

Why don't you use make<>?

wasm/InputChunks.h
13 ↗(On Diff #135163)

after relocations are applied

wasm/InputFiles.cpp
214–215 ↗(On Diff #135163)

You can eliminate Global local variable.

wasm/InputFiles.h
124 ↗(On Diff #135163)

InputGlobal *Global

wasm/InputGlobal.h
1 ↗(On Diff #135163)

Wrong header name

19–20 ↗(On Diff #135163)

I think using in a header isn't a good idea.

25 ↗(On Diff #135163)

Wasm Global Variable is perhaps better? Global functions are global, so the use of Global is wasm-speak.

42 ↗(On Diff #135163)

How large is WasmGlobal? I wonder which is better, keeping it through a reference or copying it here.

45 ↗(On Diff #135163)

It is better to keep this kind of things trivially destructible.

wasm/Symbols.h
212 ↗(On Diff #135163)

Ditto

234–235 ↗(On Diff #135163)

I prefer explicit code even if it's a bit verbose, so let's avoid default parameters in lld. I believe you want to pass (null, null) only when you are adding synthetic sections, so in that case, you can explicitly pass (null, null).

wasm/Writer.cpp
131 ↗(On Diff #135163)

DefinedGlobal is a symbol type, so this variable name is a it confusing.

699–708 ↗(On Diff #135163)

Let's be consistent -- omit {}.

sbc100 updated this revision to Diff 135276.Feb 21 2018, 9:43 AM
  • add Symbol::isLive()
sbc100 updated this revision to Diff 135286.Feb 21 2018, 10:17 AM
sbc100 marked 7 inline comments as done.
  • feedback
  • clang-format
  • feedback
test/wasm/many-functions.ll
14 ↗(On Diff #135159)

This is designed to test the case there there are more then 127 functions, which means the number of functions no longer can serialized as a single byte LEB. Since wasm uses LEBs extensively there can be a serialization bugs that only show up as we go from 127 to 128 elements. if you can think of a better way to tickle this case I'm open to changing this test.

(Also, I just noticed the comment above above describes this)

wasm/InputChunks.h
13 ↗(On Diff #135163)

But relocations are applies after the memcpy, no?

wasm/InputGlobal.h
19–20 ↗(On Diff #135163)

We have other uses like this of using specific types. If its OK with you I'll cleanup in a separate change and we can discuss there.

42 ↗(On Diff #135163)

Around 16 bytes or so. And there is currently only one of them :) There may be more soon, but currently we are only talking about stack_pointer. If we find users are generating a lot of these we could reconsider this. In that case we could need to handle the stack_pointer differently since we currently really on the mutability of this field to set the SP value.

For now I think this is fine.

45 ↗(On Diff #135163)

Why? I think its better to match the pattern used in InputChunks.h (initially anyway).

(Also, Optional is only non-trivial on gcc due to a compiler bug)

ruiu added inline comments.Feb 21 2018, 10:38 AM
wasm/InputChunks.h
13 ↗(On Diff #135163)

I meant there was a typo in this line.

wasm/InputGlobal.h
42 ↗(On Diff #135163)

Agreed.

45 ↗(On Diff #135163)

Because I thought you would allocate a lot of instances of this class for a large program, just like we do to Symbol.

sbc100 updated this revision to Diff 135297.Feb 21 2018, 10:46 AM
  • fix typos in comment
ncw added inline comments.Feb 22 2018, 1:28 AM
wasm/Symbols.h
72–73 ↗(On Diff #135297)

OK, I'm happy with doing GC for globals in a separate place, if that's what you prefer... the isLive method is good. The comment here isn't accurate though (pedantic complaint this!) - "For function/data symbols, only valid after calling markLive; for global symbols only valid after global GC has been done".

wasm/Writer.cpp
816 ↗(On Diff #135297)

(I guess this should be AddInputGlobal now)

898 ↗(On Diff #135297)

It would be neater to give SyntheticFunction a std::string member, and call its constructor with std::move(CtorFunctionBody) so that the SyntheticFunction owns the synthetic body. Can do post-merge.

sbc100 updated this revision to Diff 135516.Feb 22 2018, 2:10 PM
sbc100 marked an inline comment as done.
  • feedback
wasm/Writer.cpp
898 ↗(On Diff #135297)

Sure. Unrelated to this change though.

wabt change is ready to go: https://github.com/WebAssembly/wabt/pull/769

Any more feedback on this?

sbc100 updated this revision to Diff 135530.Feb 22 2018, 3:37 PM
  • typos
  • feedback
  • Simplify address calculations for data symbols
ruiu accepted this revision.Feb 22 2018, 4:24 PM

LGTM

There are a few things that hasn't been addressed, but we can address them later if we still want to do that.

wasm/SymbolTable.cpp
252–264 ↗(On Diff #135530)

Overall, in the symbol table, we should have three functions for Function, GLobal and Data, instead of combining them into one function. I believe that'd be easier to read.

wasm/Symbols.h
174–176 ↗(On Diff #135530)

That's a bit alarming that this function has so many parameters with default arguments. I generally do not prefer default parameters much because, for every optional parameter, it essentially defines a new overloaded function behind the scene. I'll try to address that later.

wasm/Writer.cpp
400 ↗(On Diff #135530)

We should split this function because it grows organically to the point that it's too large.

ruiu added a comment.Feb 22 2018, 4:28 PM

(So my previous comments were optional -- you can submit without fixing them.)

sbc100 updated this revision to Diff 135591.Feb 22 2018, 9:08 PM
  • remove default args
sbc100 marked an inline comment as done.Feb 22 2018, 9:09 PM

I fixed the DefinedData constructor args. The rest we can follow up with. Thanks for all the reviewing!

Closed by commit rL325861: [WebAssembly] Add explicit symbol table (authored by sbc, committed by ). · Explain WhyFeb 22 2018, 9:12 PM
This revision was automatically updated to reflect the committed changes.