This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Add explicit symbol table
ClosedPublic

Authored by sbc100 on Feb 13 2018, 3:51 PM.

Details

Summary

This change modified lld to in response the llvm change which
moved to a more explicit symbol table in the object format.

Based on patches by Nicholas Wilson:

  1. https://reviews.llvm.org/D41955
  2. https://reviews.llvm.org/D42585

The primary difference that we see in the test output is that
for relocatable (-r) output we now have symbol table which
replaces exports/imports and globals.

See: https://github.com/WebAssembly/tool-conventions/issues/38

Diff Detail

Repository
rL LLVM

Event Timeline

sbc100 created this revision.Feb 13 2018, 3:51 PM
sbc100 updated this revision to Diff 134139.Feb 13 2018, 3:58 PM
  • cleanup
sbc100 updated this revision to Diff 134140.Feb 13 2018, 4:03 PM
  • commit
ruiu added a comment.Feb 13 2018, 4:07 PM

Could you explain what you meant by "a more explicit symbol table"? This is fairly large change, so it isn't very easy to get the gist of it.

sbc100 edited the summary of this revision. (Show Details)Feb 13 2018, 4:09 PM

This is indeed a large change. I added a little more to the description.

@ncw has been working on these changes as two seperate partss (actual 4 parts, 2 in llvm and 2 on lld). I decided to merge them into single commit in each repo.

Essentially, for the relocatable output (llc output and lld -r output) we using an explicit symbol table rather than trying to module the entire thing using wasm's import/export/global concepts.

We get several advantages when doing this:

  1. relocations can refer a symbol index (previously we have no symbol index space)
  2. Data symbols can have other information such as size (previously we have just a single 32-bit address)
  3. Data symbols can specify which segment they live in. We had we previously having trouble with data symbols that we on the boundy of two segments. Now we store "segment index + offset".

Some downside:

  1. Relocatable output is less readable by standard wasm tools (and browsers).
  2. Relocatable and executable output have two different way to export things.

The original bug: https://github.com/WebAssembly/tool-conventions/issues/38

sbc100 edited the summary of this revision. (Show Details)Feb 13 2018, 4:16 PM

We are hoping this will be the last major change before we try push WebAssembly out of experimental on the llvm side. After that things should settle down (somewhat).

sbc100 updated this revision to Diff 134148.Feb 13 2018, 4:21 PM
  • format
sbc100 updated this revision to Diff 134150.Feb 13 2018, 4:33 PM
  • cleanup
ruiu added inline comments.Feb 13 2018, 5:22 PM
wasm/InputChunks.h
183 ↗(On Diff #134140)

Chunk is a concept to represent some contiguous bytes in an input file or a synthesized one. Essentially, a chunk is a section or something similar.

It doesn't seems to me that this class falls in that category, though I may be wrong due to lack of knowledge on wasm. Is this an appropriate abstraction?

wasm/InputFiles.cpp
248 ↗(On Diff #134140)

I'd just call Symbols.push_back(createUndefined(WasmSym)) and remove S.

Or, I'd move this switch-cases to a new function returning Symbol * and call like this:

Symbols.push_back(createSymbol(WasmSym));
wasm/InputFiles.h
111 ↗(On Diff #134140)

Symbol -> Index

132–133 ↗(On Diff #134140)

It seems more straightforward if you define getDataAddress and getDataSize.

wasm/MarkLive.cpp
73–75 ↗(On Diff #134140)

Since we skip only one line, I'd flip the condition so that we can wreite

for (...)
  if (Reloc.Type != ...)
    Enqueue(...);
wasm/SymbolTable.cpp
134–137 ↗(On Diff #134140)

If you wrap this up as a lambda

auto Error = [&](const Twine &Old, const Twine &New) {
  error(toString(NewType) + " type mismatch: " + Existing.getName() +
        "\n>>> defined as " + Old + " in " +
        toString(Existing.getFile()) + "\n>>> defined as " + New +
        " in " + F.getName());
};

then you can eliminate OldSigStr and NewSigStr variables from this function.

wasm/Symbols.cpp
140 ↗(On Diff #134140)

Extraneous ()

201 ↗(On Diff #134140)

I believe llvm_unreachable is defined as _Noreturn, so you don't need this return "" after llvm_noreturn.

217 ↗(On Diff #134140)

Ditto

wasm/Symbols.h
152–154 ↗(On Diff #134140)

A function is created for a chunk, so it is a bit odd that there's a situation that you have to change a chunk for an existing function symbol. What is this for?

238–244 ↗(On Diff #134140)

Why does this have two constructors?

wasm/Writer.cpp
194–198 ↗(On Diff #134140)

This is where you want to use cast instead of dyn_cast.

} else {
   auto *GlobalSym = cast<GlobalSymbol>(Sym);
   ....
}
412–416 ↗(On Diff #134140)

This nested ?: operators look a bit fancy to me. I'd perhaps make this a tiny helper function that takes a Sym and returns a Kind.

wasm/WriterUtils.cpp
221–225 ↗(On Diff #134140)

This is perhaps a personal taste, but isn't this better?

std::string S = toString(static_cast<ValType>(Sig.Type));
if (Sig.Mutable)
  return "mutable " + S;
return S;
sbc100 updated this revision to Diff 134153.Feb 13 2018, 6:38 PM
sbc100 marked 9 inline comments as done.
  • feedback
sbc100 added inline comments.Feb 13 2018, 6:38 PM
wasm/InputChunks.h
183 ↗(On Diff #134140)

Yes, are you very astute. This was my feedback on @ncw 's original change too. However after playing around with it locally a bunch I decided it might be OK (for now anyway).

I did put a TODO below about this.

One idea which would work OK is to remove this from the "InputChunk" class hierarchy. However, if I do this then this new class is out on its own, maybe in a new file. Not a big deal I guess.

If we come up with higher level concept such as a more general "ElementFromIntoFile" :) perhaps we could share a bare class? I'm happy to experiment more with that once this lands. Or if you feel strongly, before it lands.

wasm/InputFiles.cpp
248 ↗(On Diff #134140)

Makes sense. I was trying to minimize the delta. Perhaps I split that off into a pre-(or post) change.

wasm/InputFiles.h
132–133 ↗(On Diff #134140)

Remove this function completely.

wasm/Symbols.h
152–154 ↗(On Diff #134140)

This exists purely for the synthetic function which we create in Writer::createCtorFunction(). In this case we want to create the symbol at startup, and then create its body later on. I also agree this could be factored more elegantly. I'll add a TODO comment here.

238–244 ↗(On Diff #134140)

One is for file-defined globals and one is for synthetic defined globals. Comment added.

ncw accepted this revision.Feb 14 2018, 5:14 AM

All looks good, thanks for integrating with the Symbols.h changes!

wasm/InputChunks.h
183 ↗(On Diff #134140)

My feedback when Sam asked about this was: Well it's not exactly a chunk of contiguous bytes with relocations, but it is nonetheless a piece of "data" from the input file that has to be copied to the output file.

The data is a struct WasmInitExpr; it's a fixed-size chunk of data that doesn't support any relocations, but is nonetheless a kind of input section, where the contents satisfy certain constraints. We could in fact have the data() method return a reference to its binary encoding as the "Body" of the chunk that will be written out in the GLOBALS part of the Wasm file, and then get rid of the getInitExpr() method. Would that be a suitable post-commit tidy?

wasm/InputFiles.cpp
183 ↗(On Diff #134153)

Could just use cast<> to get the assertion for free?

wasm/Symbols.h
135 ↗(On Diff #134153)

Could add a comment explaining that the symbol type is available via the InputChunk for DefinedFunction. Even though I implemented that, and forgot about it just now when I was reviewing, and spent a minute trying to remember how checkSymbolTypes was able to work if the FunctionType wasn't set for defined functions!

This revision is now accepted and ready to land.Feb 14 2018, 5:14 AM
sbc100 updated this revision to Diff 134275.Feb 14 2018, 11:27 AM
sbc100 marked an inline comment as done.
  • rebase + feedback
ruiu added a comment.Feb 14 2018, 11:54 AM

Honestly I don't think I understand what InputGlobal class is doing and what's that for. The comment says that InputGlobal represents a single Wasm Global, and for that purpose, we do have symbol class right? That I can't understand that part of code doesn't necessarily mean that that's wrong, but it is a signal that something isn't quite right.

sbc100 updated this revision to Diff 134290.Feb 14 2018, 12:03 PM
  • clang-format

Maybe I can explain a little better.

This change introduces this new concept of a first class wasm "global" which can be named by "GlobalSymbol". This is distinct from data objects which are names via "DataSymbol". For now, llvm is only using one actual such wasm global (the one it used for stack pointer manipulation via the __stack_pointer symbol).

For parity with Functions and Data objects we want to allow a many-to-one mapping between symbols and the things symbols point to. This is mostly for supporting aliases. This means you can have several names (Symbols) pointing to a given InputFunction, InputGlobal, or InputSegment (segments can have even more names pointing to them because symbols can point to offsets within a segment).

However, as you point out, it is strange that InputGlobal is not really an InputChunk, in that it is not copied into place with memcpy + relocations like the other two subtypes. We can fix that in a followup CL, either by having it use memcpy for its tiny region, or my breaking the inheritance hierarchy.

ruiu added a comment.Feb 14 2018, 12:42 PM

It sounds like InputGlobal is a symbol. Is this correct? (I know that you guys want to fix this later in a follow-up patch, but I cannot say LGTM to the code that I don't understand how it works...)

ruiu added inline comments.Feb 14 2018, 12:52 PM
wasm/InputChunks.h
193–194 ↗(On Diff #134290)

Looks like this class is defined to wrap WasmGlobal objects so that they have these methods -- {set,get,has}OutputIndex. Is that correct? This class design looks really odd to me.

sbc100 added a comment.EditedFeb 14 2018, 12:52 PM

It sounds like InputGlobal is a symbol. Is this correct? (I know that you guys want to fix this later in a follow-up patch, but I cannot say LGTM to the code that I don't understand how it works...)

It represents a wasm global that could have an initial value. Several symbols could point to a given global. For example __stack_pointer and __stack_pointer_alias could both point to global 0, and in this case we would only want to write a single global to the final output and have both those symbols resolve to the same value when used in relocations.

sbc100 added inline comments.Feb 14 2018, 12:57 PM
wasm/InputChunks.h
193–194 ↗(On Diff #134290)

Yes, that is pretty much exactly what it does. That is the behavior we need.

Should we move it out of this hierarchy, and called it "OutputGlobal" perhaps?

ruiu added a comment.Feb 14 2018, 1:04 PM

What is the relationship between Wasm Global and Wasm Symbol? In most object file formats, global/local is an attribute of a symbol, so when you say "global" it implies that it is a symbol. So I was thinking of symbols when you wrote "wasm global". But it sounds like that's different in wasm.

Sorry, I can see why its confusing. The concept of a wasm global is different to the concept of a symbol (which can have local or global binding). We now have three symbols types which can name data, functions or wasm globals. I the ELF world the first two would be called STT_OBJECT and STT_FUNC. This new type has no ELF equivalent since its a wasm specific concept. So far the only use for this symbol type is the stack pointer but we see future uses (for example for tls access and possibly via intrinsics or assembly).

ruiu added a comment.Feb 14 2018, 1:22 PM

So, InputFunction is a function, InputSegment is a data segment, and what is "wasm global" for? It sounds like it is more like data than a function, but I wonder how "wasm global" is different from regular data section.

So, InputFunction is a function, InputSegment is a data segment, and what is "wasm global" for? It sounds like it is more like data than a function, but I wonder how "wasm global" is different from regular data section.

Wasm has these first class things called globals that live outside the address space:
https://github.com/WebAssembly/design/blob/master/Modules.md#global-section

The can only be accessed via special get_global and set_gobal instructions.

ncw added a comment.Feb 15 2018, 1:04 AM

So, InputFunction is a function, InputSegment is a data segment, and what is "wasm global" for? It sounds like it is more like data than a function, but I wonder how "wasm global" is different from regular data section.

It's basically a special type of data section; what makes it special is that it's accessed more like a register than a memory region.

The InputGlobal object represents the allocation of the memory/storage for the symbol, and contains the region of memory to be copied into the output binary with the initial value of the data.

You're right, it's very similar to a data symbol. The InputGlobal has really the same role for a GlobalSymbol as an InputSection does for a DataSymbol.

// Will be stored in ordinary memory, addressable down to individual bytes.
// Creates a 12-byte data section (InputSegment), holding its initial value, which
// the data Symbol points to as its definition.
std::tuple<int,int,int> global_variable = {1,2,3};

// Will create a 4-byte section, but using a "Wasm global". Not addressable down
// to individual bytes, it's like a register. Can't take address-of, can only assign
// and read from the storage. Requests the WebAssembly virtual machine to
// allocate a thread-local register. Constrained to be of register type (int32/float32/
// int64/float64) so the size of an InputGlobal is always a 4- or 8-byte section.
//
// Unlike x86 registers, you can have have as many as you want, they are storage
// slots allocated at program startup, just like ordinary sections are mapped to
// RAM. (NB The WebAssembly interpreter will spill the globals to RAM if the "real"
// CPU doesn't have enough spare registers to use for the Wasm globals). So it's like
// data, in that these are variables with a value, but the mechanics of the storage
// slots are different. There's no analogue in ELF, because x86 has a fixed number of
// registers, and the binary doesn't have to ask for them to be created.
//
// The 4-byte section containing the initial value is represented by an InputGlobal,
// which the Wasm-global symbol points to as its definition.  It's written out in a
// different place in the final Wasm object, since it has distinct semantics.
//
// NB. This attribute isn't actually implemented in the C frontend ("Wasm globals"
// are only accessible via intrinsics). Conceptually these are just variables though.
int __attribute__((wasm_global)) wasm_global = 123;
sbc100 updated this revision to Diff 134479.Feb 15 2018, 11:38 AM
  • shorten yaml names
ruiu added a comment.Feb 15 2018, 4:55 PM

I read the documentation of wasm to understand its model better. Here is my observation (some of them are very obvious to you guys though)

  • Wasm has linear memory (which is basically a contiguous byte-addressable memory), functions (that don't exist in the linear memory space but you can call them by their indices) and global variables (that live in their own addressing space)
  • Globals variables basically consists of types and indices

It looks more and more like global variables are just symbols with indices. Do you agree? I believe it can naturally be modeled as a symbol.

That makes me wonder what the current DefinedGlobal represents. Is this a symbol that lives in the linear memory or global variable?

sbc100 added a comment.EditedFeb 15 2018, 5:15 PM

In the old code we used DefinedGlobal to meant global data in linear memory. In the new code that got renamed to DefinedData and a new type of entity called DefinedGlobal was added for wasm globals. Its a shame it has the same name in the new code, but I think it makes sense.

To make this even more complicated to follow, we previously modeled data symbols as wasm globals... i.e. we use the wasm globals in the object format to store the addresses in linear memory of data symbols. A little confusing, and one reason why we are making this change to have a more explicit symbol table.

We could simply store the wasm global data in the DefinedGlobal itself, rather then having it point to an InputGlobal. I think the problem there is when more than one symbol aliases the same global. But, we don't have the case yet so perhaps we can push that down the line a little. Let me see if I can simplify things.

ruiu added a comment.Feb 15 2018, 5:19 PM

We could simply store the wasm global data in the DefinedGlobal itself, rather then having it point to an InputGlobal. I think the problem there is when more than one symbol aliases the same global. But, we don't have the case yet so perhaps we can push that down the line a little. Let me see if I can simplify things.

What is your concern on aliasing? In general, symbols can have aliases (by just pointing to the same offset in the same section). We don't do anything special about it in other file formats, and it just works fine -- we don't need to find if a symbol has aliases or not to generate a correct output. I'm wondering if there's some wasm-specific concern about it.

ncw added a comment.Feb 15 2018, 5:38 PM

To make this even more complicated to follow, we previously modeled data symbols as wasm globals... i.e. we use the wasm globals in the object format to store the addresses in linear memory of data symbols. A little confusing, and one reason why we are making this change to have a more explicit

We could simply store the wasm global data in the DefinedGlobal itself, rather then having it point to an InputGlobal. I think the problem there is when more than one symbol aliases the same global. But, we don't have the case yet so perhaps we can push that down the line a little. Let me see if I can simplify things.

I'm not sure it would be simpler to put the definition (initialiser) for the symbol in the symbol itself. As you say, it's conceptually wrong because it doesn't work with aliases (unused in this case, but it demonstates the model wouldn't be right); secondly it breaks the link between defined symbols and their "chunk" containing the initializer; and thirdly, worst, prevents unifying the handling of globals/data (eg with comdat for globals, then symbols could then be initialized in the same way for all types).

I think we should aim to have globals work in the same way in LLD as data/function symbols, to keep the mental model simple, rather than arbitrarily constraing them with different resolution semantics (eg coding up a way of defining global symbols that doesn't allow aliases).

  • Globals variables basically consists of types and indices

And an initializer! That's what distinguishes a defined global from an imported/undefined one, in the same way that the initializer (bss or data section) does for a data symbol, or the function body does for a function symbol.

It looks more and more like global variables are just symbols with indices. Do you agree? I believe it can naturally be modeled as a symbol.

That makes me wonder what the current DefinedGlobal represents. Is this a symbol that lives in the linear memory or global variable?

DefinedGlobal is a symbol representing storage that's only accessible via its global index, rather than via a linear memory address. "Defined" means the Wasm file allocates that storage, rather than importing the storage from another DSO. We definitely need the distinction between defined and undefined for globals.

The key takeaway is that a global is a storage/memory slot that can be allocated by a binary, just like a data symbol in ELF, it's only the addressing model that's different (via index rather than RAM address).

Thanks for looking into this!

ruiu added a comment.Feb 15 2018, 5:46 PM

And an initializer! That's what distinguishes a defined global from an imported/undefined one, in the same way that the initializer (bss or data section) does for a data symbol, or the function body does for a function symbol.

An initializer can just be a member of a symbol, no?

Global variables look very similar to imported functions. There's no "body" of imported symbols, and therefore there's no chunk that represents imported function body, even though imported symbols are "defined" (as opposed to "undefined" symbols that have to be resolved until the end of linking). Symbols doesn't have to have chunks, and that's not actually odd.

ncw added a comment.Feb 15 2018, 5:46 PM

Aliasing works because of InputSection/InputChunk, which allows multiple symbols to reference the same chunk of memory, without having the section or body duplicated in the output. If you get rid of InputGlobal, you'll have aliasing issues, because the output will duplicate the global, causing it to be allocated storage twice - just like if a data section aliased by two symbols would chew up double the RAM if the symbols weren't able to share the InputSection. Plus the semantics would be wrong because writes to one alias should be visible by reads to the other.

ruiu added a comment.Feb 15 2018, 5:52 PM

Aliasing works because of InputSection/InputChunk, which allows multiple symbols to reference the same chunk of memory, without having the section or body duplicated in the output. If you get rid of InputGlobal, you'll have aliasing issues, because the output will duplicate the global, causing it to be allocated storage twice - just like if a data section aliased by two symbols would chew up double the RAM if the symbols weren't able to share the InputSection. Plus the semantics would be wrong because writes to one alias should be visible by reads to the other.

I don't think you need to use InputChunk to address the issue. You can just maintain the indices of globals the linker has already emitted so that it doesn't assign different indices to two Globals that originally had the same index, for example. There may be easier way of doing it than that.

ncw added a comment.Feb 15 2018, 5:56 PM

An initializer can just be a member of a symbol, no?

By that logic, you wouldn't need InputSection at if all in the ELF linker. You just put the body of each symbol on the symbols, rather than allowing multiple symbols to defence the same section.

Global variables look very similar to imported functions. There's no "body" of imported symbols, and therefore there's no chunk that represents imported function body, even though imported symbols are "defined" (as opposed to "undefined" symbols that have to be resolved until the end of linking). Symbols doesn't have to have chunks, and that's not actually odd.

Globals are really more like data symbols. An int variable would have a 4-byte section asking the ELF interpreter to allocate 4 bytes of ram. A wasm global I32 has its own section in the wasm file, asking for a 4byte register-backed region to be allocated.

Do we have a different understanding of "imported"? An imported function is definitely undefined from a wasm and a linking point of view - both in ELF and warm, the function symbols with no body are undefined and the linker is looking for a definition against which to reduce any calls.

ncw added a comment.Feb 15 2018, 6:05 PM

I don't think you need to use InputChunk to address the issue. You can just maintain the indices of globals the linker has already emitted so that it doesn't assign different indices to two Globals that originally had the same index, for example. There may be easier way of doing it than that.

That's true, I admit. But you haven't done that for functions (even though it would be equally possible to get rid of InputSection or InputFunction in the same way). I just thinks it really helps comprehension if all the symbol clases are implemented in the same way, rather than differing in ad-hoc ways. Wanting DefinedFunction to work like DefinedGlobal is similar to wanting the wasm Symbol class to match the ELF symbol class.

I don't want to argue though - I'll go with your preference as the code owners, now I've (hopefully) explained my reasoning.

ruiu added a comment.Feb 15 2018, 6:05 PM

By that logic, you wouldn't need InputSection at if all in the ELF linker. You just put the body of each symbol on the symbols, rather than allowing multiple symbols to defence the same section.

I don't know if I understand what you wrote correctly. In ELF, you need both symbols and sections in the linker. One section can have multiple symbols at its various offsets, and some of them are aliases (e.g. two symbols can point to the same location in the same section).

Globals are really more like data symbols. An int variable would have a 4-byte section asking the ELF interpreter to allocate 4 bytes of ram. A wasm global I32 has its own section in the wasm file, asking for a 4byte register-backed region to be allocated.

Well, that runtime allocates some memory at runtime doesn't necessarily mean that that has to be represented as Chunk. At least to me the use of Chunk in this patch to represent Globals is very odd and is not an intended use of Chunk. Chunk is designed to represent sections or something like that, and Globals is different.

Do we have a different understanding of "imported"? An imported function is definitely undefined from a wasm and a linking point of view - both in ELF and warm, the function symbols with no body are undefined and the linker is looking for a definition against which to reduce any calls.

By imported, I mean symbols in DSO (DLL) files. In ELF, it's SharedSymbol. In COFF, It's DLLImported symbol. They are considered "defined" because they are already resolved, but the actual code is not given until load-time.

ncw added a comment.Feb 15 2018, 6:14 PM

I see about "imported", thanks! I hadn't realized that terminology in DSOs, my mistake. In wasm "imported" really is undefined.

Maybe we could compromise on keeping InputGlobal, but renaming it and making it not derive from InputChunk. That would almost compile as-is in fact, and I think maybe you've right that would be cleaner. MarkLive would have to gain support for it, but that's only a few lines of code actually.

I could try that tomorrow, or Sam could try now?

In D43264#1009610, @ncw wrote:

I see about "imported", thanks! I hadn't realized that terminology in DSOs, my mistake. In wasm "imported" really is undefined.

Maybe we could compromise on keeping InputGlobal, but renaming it and making it not derive from InputChunk. That would almost compile as-is in fact, and I think maybe you've right that would be cleaner. MarkLive would have to gain support for it, but that's only a few lines of code actually.

I could try that tomorrow, or Sam could try now?

I've been trying to do this this evening but its fairly entangled. I will try again tomorrow, perhaps with a preparatory change to precede it.

ncw added a comment.Feb 16 2018, 1:53 PM

I've been trying to do this this evening but its fairly entangled. I will try again tomorrow, perhaps with a preparatory change to precede it.

Sorry Sam, somehow this didn't catch my eye in my inbox -I realise now we might have been working on the same cleanup :( Oh well, don't worry if you go a different direction to what I did in D43391.

I can't really chat this evening - not that I'd be much help anyway!

If you and Rui can come up with a solution that works for InputGlobal - go ahead and merge! I'll take any solution, I'm only arguing for changes at the moment that are fairly close to how these patches are currently, to so that we don't end up with another big refactor on our hands before we can merge what we've got so far.

Do you have any thoughts at this point on how you think InputGlobal should fit into the hierarchy, eg like the comment in D43391?

sbc100 added a comment.EditedFeb 16 2018, 2:42 PM

It also didn't feel right to have InputGlobal to inherit from InputChunk. For me the Chunks are really for the non-synthetic sections. That is the ones construct using memory copied from the input files + relocations.

I think of the global section as being a sythentic section (indeed it is right now). So I guess I agree with ruiu here. But don't worry, I'm sure we can find a solution...we will get this thing landed soon one way or another :)

sbc100 updated this revision to Diff 135159.Feb 20 2018, 4:04 PM
  • Don't use InputChunk to model globals
sbc100 updated this revision to Diff 135163.Feb 20 2018, 4:16 PM
  • revert parts
ncw accepted this revision.Feb 21 2018, 4:01 AM

It looks OK still. It's familiar, in that I tried a similar refactor (making InputGlobal not inherit InputChunk), and I can see you've faced the same problems.

I think it works, just some comments about consistency below.

wasm/MarkLive.cpp
45–47 ↗(On Diff #135163)

You're not marking globals as Live, so how will they avoid being GC'd? You need something like this to be added I think:

if (DefinedGlobal *G = dyn_cast<DefinedGlobal>(Sym)) {
  G->Global->Live = true;
  return;
}
InputChunk *Chunk = ...

It's probably just slipped under the radar because the stack-pointer is a synthetic global, and there aren't any tests yet for non-synthetic ones.

I have some tests I'm waiting to commit for more global support, I guess we could fix GC for defined globals then.

wasm/SymbolTable.cpp
147–148 ↗(On Diff #135163)

It's odd that this is the one case without an assert(WasInserted) - I've forgotten, is there a reason we can't assert here like we do in addSyntheticFunction/Global?

wasm/Symbols.cpp
49–51 ↗(On Diff #135163)

Missing specialisation for DefinedGlobal here; hasOutputIndex should definitely match the implementation for getOutputIndex below!

166–173 ↗(On Diff #135163)

Shouldn't this method go away too now? We got rid of it on master for FunctionSymbol, so GlobalSymbol should match that unless there's a good reason to use the old scheme still.

wasm/Writer.cpp
144 ↗(On Diff #135163)

Should this unique_ptr<InputGlobal> still be here? There are now two of these, one in the Linker and one here. (Personally, I think it makes sense to own the stack-pointer here and get rid of the InputGlobal in Linker, so that CtorFunction and StackPointer behave in the same way; I can't immediately think of a reason why they shouldn't...)

721–723 ↗(On Diff #135163)

Also needs a special branch like in MarkLive:

if (DefinedGlobal* G = dyn_cast<DefinedGlobal>(Sym))
  if (G->Global && !G->Global->Live)
    continue;
}
InputChunk *Chunk = Sym->getChunk();
if (Chunk && !Chunk->Live)
  continue;

I know it's tedious... but if you're going to have the Live flag on InputGlobal (which I agree we do want) then we have to check it separately everywhere that we currently check the InputChunk::Live flag. That's the nuisancesome consequence of splitting InputGlobal off from InputChunk, but we can stomach that cost for now.

804–809 ↗(On Diff #135163)

?? Shouldn't this be done in MarkLive; why should globals be different to functions in this regard?

sbc100 updated this revision to Diff 135275.Feb 21 2018, 9:26 AM
sbc100 marked 10 inline comments as done.
  • feedback
wasm/MarkLive.cpp
45–47 ↗(On Diff #135163)

Globals (like Types) are always generated precisely regardless of --gc-sections, and so are not handled here.

wasm/SymbolTable.cpp
147–148 ↗(On Diff #135163)

Oops, that must have been merge conflict. Restored.

wasm/Writer.cpp
721–723 ↗(On Diff #135163)

I think we will want an isLive() method on Symbol, like the COFF linker has. I will add that before landing this change.

804–809 ↗(On Diff #135163)

When I added GC I decided to make it only apply to functions and data (chunks). For synthetic sections like the Type section, and now the Global section, I decided to always be precise and write only the types/globals that are needed. This means that you effectivly get type and global GC regardless of the --gc-sections flag. The code above marks the types are live already, this just extends it to handle globals

ruiu added inline comments.Feb 21 2018, 9:40 AM
test/wasm/many-functions.ll
14 ↗(On Diff #135159)

I know this test tests "many functions", but it seems a bit too many. Do you really need this many functions?

wasm/Driver.cpp
309 ↗(On Diff #135163)

Why don't you use make<>?

wasm/InputChunks.h
13 ↗(On Diff #135163)

after relocations are applied

wasm/InputFiles.cpp
214–215 ↗(On Diff #135163)

You can eliminate Global local variable.

wasm/InputFiles.h
124 ↗(On Diff #135163)

InputGlobal *Global

wasm/InputGlobal.h
1 ↗(On Diff #135163)

Wrong header name

19–20 ↗(On Diff #135163)

I think using in a header isn't a good idea.

25 ↗(On Diff #135163)

Wasm Global Variable is perhaps better? Global functions are global, so the use of Global is wasm-speak.

42 ↗(On Diff #135163)

How large is WasmGlobal? I wonder which is better, keeping it through a reference or copying it here.

45 ↗(On Diff #135163)

It is better to keep this kind of things trivially destructible.

wasm/Symbols.h
212 ↗(On Diff #135163)

Ditto

234–235 ↗(On Diff #135163)

I prefer explicit code even if it's a bit verbose, so let's avoid default parameters in lld. I believe you want to pass (null, null) only when you are adding synthetic sections, so in that case, you can explicitly pass (null, null).

wasm/Writer.cpp
131 ↗(On Diff #135163)

DefinedGlobal is a symbol type, so this variable name is a it confusing.

699–708 ↗(On Diff #135163)

Let's be consistent -- omit {}.

sbc100 updated this revision to Diff 135276.Feb 21 2018, 9:43 AM
  • add Symbol::isLive()
sbc100 updated this revision to Diff 135286.Feb 21 2018, 10:17 AM
sbc100 marked 7 inline comments as done.
  • feedback
  • clang-format
  • feedback
test/wasm/many-functions.ll
14 ↗(On Diff #135159)

This is designed to test the case there there are more then 127 functions, which means the number of functions no longer can serialized as a single byte LEB. Since wasm uses LEBs extensively there can be a serialization bugs that only show up as we go from 127 to 128 elements. if you can think of a better way to tickle this case I'm open to changing this test.

(Also, I just noticed the comment above above describes this)

wasm/InputChunks.h
13 ↗(On Diff #135163)

But relocations are applies after the memcpy, no?

wasm/InputGlobal.h
19–20 ↗(On Diff #135163)

We have other uses like this of using specific types. If its OK with you I'll cleanup in a separate change and we can discuss there.

42 ↗(On Diff #135163)

Around 16 bytes or so. And there is currently only one of them :) There may be more soon, but currently we are only talking about stack_pointer. If we find users are generating a lot of these we could reconsider this. In that case we could need to handle the stack_pointer differently since we currently really on the mutability of this field to set the SP value.

For now I think this is fine.

45 ↗(On Diff #135163)

Why? I think its better to match the pattern used in InputChunks.h (initially anyway).

(Also, Optional is only non-trivial on gcc due to a compiler bug)

ruiu added inline comments.Feb 21 2018, 10:38 AM
wasm/InputChunks.h
13 ↗(On Diff #135163)

I meant there was a typo in this line.

wasm/InputGlobal.h
42 ↗(On Diff #135163)

Agreed.

45 ↗(On Diff #135163)

Because I thought you would allocate a lot of instances of this class for a large program, just like we do to Symbol.

sbc100 updated this revision to Diff 135297.Feb 21 2018, 10:46 AM
  • fix typos in comment
ncw added inline comments.Feb 22 2018, 1:28 AM
wasm/Symbols.h
72–73 ↗(On Diff #135297)

OK, I'm happy with doing GC for globals in a separate place, if that's what you prefer... the isLive method is good. The comment here isn't accurate though (pedantic complaint this!) - "For function/data symbols, only valid after calling markLive; for global symbols only valid after global GC has been done".

wasm/Writer.cpp
816 ↗(On Diff #135297)

(I guess this should be AddInputGlobal now)

898 ↗(On Diff #135297)

It would be neater to give SyntheticFunction a std::string member, and call its constructor with std::move(CtorFunctionBody) so that the SyntheticFunction owns the synthetic body. Can do post-merge.

sbc100 updated this revision to Diff 135516.Feb 22 2018, 2:10 PM
sbc100 marked an inline comment as done.
  • feedback
wasm/Writer.cpp
898 ↗(On Diff #135297)

Sure. Unrelated to this change though.

wabt change is ready to go: https://github.com/WebAssembly/wabt/pull/769

Any more feedback on this?

sbc100 updated this revision to Diff 135530.Feb 22 2018, 3:37 PM
  • typos
  • feedback
  • Simplify address calculations for data symbols
ruiu accepted this revision.Feb 22 2018, 4:24 PM

LGTM

There are a few things that hasn't been addressed, but we can address them later if we still want to do that.

wasm/SymbolTable.cpp
252–264 ↗(On Diff #135530)

Overall, in the symbol table, we should have three functions for Function, GLobal and Data, instead of combining them into one function. I believe that'd be easier to read.

wasm/Symbols.h
174–176 ↗(On Diff #135530)

That's a bit alarming that this function has so many parameters with default arguments. I generally do not prefer default parameters much because, for every optional parameter, it essentially defines a new overloaded function behind the scene. I'll try to address that later.

wasm/Writer.cpp
400 ↗(On Diff #135530)

We should split this function because it grows organically to the point that it's too large.

ruiu added a comment.Feb 22 2018, 4:28 PM

(So my previous comments were optional -- you can submit without fixing them.)

sbc100 updated this revision to Diff 135591.Feb 22 2018, 9:08 PM
  • remove default args
sbc100 marked an inline comment as done.Feb 22 2018, 9:09 PM

I fixed the DefinedData constructor args. The rest we can follow up with. Thanks for all the reviewing!

This revision was automatically updated to reflect the committed changes.