This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Initial wasm linker implementation
ClosedPublic

Authored by sbc100 on Jun 29 2017, 5:23 PM.
Tokens
"Yellow Medal" token, awarded by pepyakin.

Details

Summary

This linking backend is a work in progress, but is enough
to link simple programs including linking against library
archives.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ruiu added inline comments.Aug 29 2017, 3:48 PM
wasm/Writer.cpp
242

It needs comment -- it is not clear to me what this is for.

286–288

I don't think you need this as you can do this at the beginning of writeTo(). You need a finalizeContents if you don't know the size of a section unless you call that function, but this is not the case. I guess you can eliminate finalizeContents entirely from this patch.

415

Debug code?

419

clang-format

513

Flip the condition and do early return.

885–887

You can remove these three {s.

890

Can you add a comment? I think I do not understand this wasm-specific function.

916

Ditto

916

Location -> Buf

By convention, Buf should be the first parameter.

927

I think you can merge this and the following switch by removing Encoding variable.

943–944

Instead of computing the size of a padding, I think it is better to just write 5-bytes yourself. I'd define write{U,S}LEB128(uint8_t *Buf, {u,}int32_t) which always writes 5 byte to Buf.

958

I think I do not understand this function. Is this to create dynamic relocations? Please write a comment.

1022

What is this for? Is this a .rel.dyn section in wasm?

1042

This needs a comment.

1059

Ditto. Please describe it for those who are not very familiar with wasm.

1124

Looks like writeOutput is used only once, and writeOutput is a one-liner. You can inline it.

1134

Please write a comment so that I can understand what this function is supposed to do.

1137
MemoryPtr = alignTo(Config->GlobalBase, DataAlignment);
1243

What are Memory and Table?

1252

What is Elem?

1335–1338

"Num" is a more popular prefix than "Total" to manage a number of occurrences of something.

krytarowski added inline comments.
wasm/Error.cpp
21 ↗(On Diff #113119)

Can we use:

#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
sbc100 added a reviewer: ruiu.Aug 29 2017, 5:09 PM
sbc100 updated this revision to Diff 113607.Sep 1 2017, 2:51 PM
sbc100 marked 45 inline comments as done.
  • Merge remote-tracking branch 'origin/master' into wasm
  • [WebAssembly] Update relocation names
  • [WebAssembly] Feedback from code review
  • [WebAssembly] clang-format the patch
  • [WebAssembly] Feedback from code review
  • [WebAssembly] Factor out write utils
  • [WebAssembly] For codereview feedback
sbc100 added inline comments.Sep 1 2017, 2:51 PM
tools/lld/CMakeLists.txt
21

I'll just remove the link for now. The linker can be/is used via flavor=wasm.

wasm/Driver.cpp
211

I ran clang-format over the whole patch to fix this and the many other places like it.

wasm/Error.cpp
21 ↗(On Diff #113119)

I don't think lld currently includes llvm's config.h. This is the exact same code in COFF/Error.cpp and ELF/Error.cpp so I think changing this can/should probably be done in a separate change.

wasm/InputFiles.cpp
143–144

I'm not sure I understand exactly what you mean. If I make WasmObj a regular pointer won't I have to manually delete it in the destructor? Why is that better? (Sorry C++11+ is kinda new to me).

FWIW, this is exacly how the COFF linker handles this (See COFF/InputFiles.cpp:92)

wasm/SymbolTable.cpp
40

Yes. In fact all lazy symbols are undefined. I believe this is try to ELF and COFF linkers too (i.e. LazyKind > LastDefinedKind)

85

Interesting, my intuition here is to check for error cases an early out.. so the that fall through at the end is the success path. But I guess this makes sense.

86–91

Also added a test for this.

wasm/Symbols.h
31

Copied from lld/COFF/Symbols.h.

49

Ok, I changed it so that lazy symbols are not undefined. This simplified the code a few places too. I had been assuming that all symbols were either defined or undefined, but it seems that doesn't need to be that case.

wasm/Writer.cpp
103

All these functions are for creating the synthetic sections, which are created in memory prior to mmaping the output file (since their size needs to be known).

103

I did some refactoring and put all these writer utility functions in a separate file.

173

Yes. Only custom sections have names. All the known sections are identified by their integer code.

186

Basically I'm printing extra information here in order to override the default behaviour of write_uleb128. Maybe the debug output a little more useful, but maybe over complicated?

239

For synthetic sections we need to fully serialize them in order to know that size of the final output file. i.e. we need to write them out before we can mmap the file.

This is only true for the synthetic sections.

286–288

Actually we don't know the size of the section until we synthesize its header, however you are right we can just this in the constructor and avoid implementing this method at all.

916

Renamed for clarity

943–944

I agree. But encodeULEB128 is defined in include/llvm/Support/LEB128.h and it takes a padding count. It should really take a total byte count to avoid having to do this calculation. But that should probably be a separate change. I work on that in parallel.

1022

This is for '-r' since the output object needs to contain all the relocs in that case (I assume this is the same for ELF/COFF)?

1243

Memory defines the linear memory required my the wasm module. It is essentially just and initial and max size.

Tables currently only used for indirect function calls. All indirect function calls are done via that table. Taking the address of a function in C causes that function to be added to this table. This is because functions don't themselves live in addressable memory.

1252

These are the element of the table (the indirect function call table).

ruiu added inline comments.Sep 6 2017, 1:54 PM
ELF/Error.cpp
19–22 ↗(On Diff #113607)

What is this change?

tools/lld/lld.cpp
37

I don't think lld -flavor <flavor> is a preferred way of dispatching, and I'd avoid using it in wasm. If you use the option, and someone implements a different linker for wasm, it still has to claim it is lld and support -flavor option. How about ld-wasm? If, for example, bfd wants to support wasm, they could name their linker as ld-wasm.bfd.

wasm/Driver.cpp
113–116

By convention, I'd rename Result -> Res and SymbolName -> Sym.

142–143

Remove a comment about sysroot.

207

I don't think this function has to be a member of LinkerDriver. You can just move the code from here to Archive::addMember, no?

291–295

This is probably a bit simpler:

// Parse -allow-undefined-file <filename>
if (auto *Arg = Args.getLastArg(OPT_allow_undefined_file))
  if (Optional<MemoryBufferRef> Buf = readFile())
    for (StringRef S : getLines(*Buf))
      Config->AllowUndefinedSymbols.insert(S);
297–304

Use error instead of fatal as fatal is a noreturn function but we want to usually do not exit on the first error.

A rule of thumb is to use fatal only for corrupted file. For other errors, use error.

wasm/InputFiles.cpp
72

I wouldn't assert(Sym) because Symtab->find(Name)->isDefined() should immediately fail with SEGV if find returns a nullptr.

82

Ditto

98–99

We avoid hash table lookup as much as possible because it's too slow. Instead of storing symbol names to vectors, we usually store Symbol * or SymbolBody * to vectors, so that we don't need to lookup a hash table. Can you do that?

Basically, we don't want to lookup a hash table more than once for each string instance. If you do more than once, then you probably should keep the hash table lookup result (i.e. Symbol * or SymbolBody *) so that it doesn't have to repeat the same hash table lookup. I think this is one of the reasons why lld is much faster than other linkers.

99–101

It feels to me that the code has probably a bit too much asserts. Can you simply return Symtab->find(Name)->getOutputIndex()?

wasm/InputFiles.h
108

Could you add blank lines before the block comments?

wasm/SymbolTable.h
23–25

This is not used in the header.

wasm/Symbols.h
58

Isn't function global?

wasm/Writer.cpp
100

It is still using the streaming output interface, but is there a reason to prefer that? Since we are using mmap-based or std::vector-backed buffer exclusively in lld in general, I don't like to introduce a new way of writing to a buffer unless it is absolutely necessary.

102

Do you have to use formatv? It seems you can do "section type [" + sectionTypeToString(Type) + "]".

256

Outline this constructor and writeTo function.

497

Calculates

sbc100 updated this revision to Diff 114286.Sep 7 2017, 4:57 PM
sbc100 marked 15 inline comments as done.
  • Merge branch 'master' into wasm
  • [WebAssembly] Update expectations now that SymbolInfo not a required YAML field.
  • [WebAssembly] Don't try to export main if its undefined
  • [WebAssembly] Review feedback
wasm/Driver.cpp
291–295

I think I originally make the argument handling and the file processing seperate since I imagined a future where these files might grow in number and/or complexity (i.e. a set of default files in addition to the command line provided one). But I guess we can add that complexity where we get there.

wasm/Symbols.h
58

Wasm symbols are modeled as either either functions, or globals (in the wasm sense). Wasm globals correspond to C/C++ globals and have addresses in linear memory. So a given symbol is either a wasm function or a wasm global.

wasm/Writer.cpp
100

For writing these synthetic sections that contains a lot of LEB encoded data I think this stream interface makes a lot of sense. The LEB encoding functions can write directly to them and they grow on demand. Without this we would need to invent a new kind of auto-growing buffer that does basically the same thing. We can't know how big the vector/buffer is up front because we don't know the length of the LEBs until we encode them.

sbc100 updated this revision to Diff 115885.Sep 19 2017, 12:42 PM
  • [WebAssembly] Fix index for R_WEBASSEMBLY_MEMORY_ADDR_* relocations
  • Merge remote-tracking branch 'origin/master' into wasm
  • [WebAssembly] Don't re-export entry point as 'main'
  • [WebAssembly] Allow mulitple wasm data segments
  • [WebAssembly] Update tests to match changes in llvm
  • [WebAssembly] Update LEB to match changes to LEB128.h
  • [WebAssembly] don't test data-layout with -r
  • [WebAssembly] Refactor code to create OutputSections.cpp
ruiu edited edge metadata.Sep 19 2017, 4:30 PM

Sorry for the slow response. This patch is huge and needed concentration. Generally,

test/wasm/archive.ll
2

All these tests need "REQUIRES: wasm" so that they won't fail on buildbots that don't have wasm enabled.

wasm/Driver.cpp
181

I'd use fatal only to report corrupted input files. Since this is just a command-line error, I'd use error instead. This way, we can report more than one error.

229

This is -help. (/help is a Windows option.)

268–271

This is more consistent with other option handlers.

if (auto *Arg = Args.getLastArg(OPT_allow_undefined_file))
  if (Optional<MemoryBufferRef> Buf = readFile(A->getValue()))
    ...

Also this code catches error when you pass -allow-undefined-filename ''.

275–279

Please use error().

278

hasArgNoClaim is the same as hasArg in lld's usage of the class, so I"d use hasArg.

wasm/Driver.h
1 ↗(On Diff #115885)

It seems everything declared in this header is actually being used only by Driver.cpp, so can you move them to Driver.cpp?

wasm/InputFiles.cpp
14

Driver.h is unused.

167–172

Do you think you can just insert symbols to the symbol table instead of temporarily storing names to these FunctionImports or GlobalImports variables? I'd just add an Undefined symbol (or something more appropriate) here and add its return value (a Symbol *) to FunctionSymbols and GlobalSymbols, so that I can remove FunctionImports and GlobalImports functions as well as getFunctionSymbol and getGlobalSymbol accessors.

184

Inlining this seems more straightforward since the function is short and used only once:

Symbols.push_back(Symtab->addUndefined(this, &Sym));
188
Symbols.push_back(Symtab->addDefined(this, &Sym));
209

Here's a question: do you have to use llvm-ar to create archive files for wasm files? I guess the answer is yes because regular ar commands don't understand wasm files so they cannot create symbol tables for them.

We have the same issue for LTO, because for LTO builds, input files are LLVM bitcode files which regular ar commands don't know about. We found that the problem is very annoying and using the right ar command is sometimes pretty hard. So we have this hack in ELF.

https://github.com/llvm-project/llvm-project/blob/51124a3e488376d9d3ba846fc251e510d082d10b/lld/ELF/Driver.cpp#L184

You don't want make a change to this patch at the moment, but I think this is something you want to consider.

267

We found that it doesn't make much sense to strip directory name components. I wouldn't call getBasename() on them.

wasm/OutputSections.cpp
30

Can you return StringRef so that it becomes more llvm-ish?

57–58

If the use of fatal is correct (if this is to report a corrupted input file), then please remove return after that, since fatal is a _Noreturn function. If not, please use error.

62

toString functions should be named in lld namespace instead of lld::<arch> namespace, so that it doesn't hide overriden toString functions in lld namespace.

63

"section: " is not part of a name, so I'd remove it. If you need this label, print it out on caller side.

64

This is perhaps a bit easier to read.

std::string S = sectionTypeToString(Section->Type);
if (!Section->Name.empty())
  return S + "(" + Section->Name + ")";
return S;
81

Looks like you can just return the RHS value instead of assigning it to a temporary variable.

98

fatal is a _Noreturn function, so no need to add break after it, as it is unreacahble.

wasm/OutputSections.h
25–26

I don't think you need this constructor because, instead of

OutputDataSegment(Foo, Bar)

you can write like this:

{Foo, Bar}
54
Offset = 0

is a preferred way of initializing a member if it always initialized to 0.

68–71

nit: you could write without reassignments:

memcpy(Buf + Offset, Header.data(), Header.size());
memcpy(Buf + Offset + Header.size(), Body.data(), Body.size());
ruiu added inline comments.Sep 19 2017, 4:30 PM
wasm/OutputSections.cpp
107

Can you use ArrayRef?

148

Is it okay to fallthrough? If so, add LLVM_FALLTHROUGH.

151

Why you can't directly call encodeULEB128(Reloc.Value, Buf, 5); here?

156

Ditto -- it seems you could call encodeSLEB128(Reloc.Value, Buf, 5); here.

165

Previous switch doesn't have a default, so the control could reach here with Encoding uninitialized, no?

166–173

Remove {}

196

I believe you don't need this because raw_string_ostream::~raw_string_ostream would do this for you.

202

If BodySize is always initialized to 0, the preferred way is to declare the variable with the initial value (i.e. BodySize = 0;)

214

remove const as it is obvious that we don't mutate Content.

231–232

You want to merge successive log into one function call because otherwise other thread could write other lines between the two lines.

wasm/OutputSections.h
51

I don't know why you need to aggregate input sections' relocations into one place. For COFF or ELF, we don't do that. Each input section has writeTo method and knows how to copy itself to a given buffer and relocate itself. Can't you do in the same way?

wasm/Symbols.cpp
47

If you know File is of ObjectFile, use cast instead of dyn_cast.

wasm/Symbols.h
95

I geuss you can initialize

ArchiveSymbol = {nullptr, 0, 0};
107–110

In lld we don't use operator overloading unless it is absolutely necessary. Since you can explicitly call toString() before emitting it to an output buffer, you can live without this. I'd remove this function.

wasm/Writer.cpp
36

constexpr instead of const may make more sense.

70

Do you need this?

145

We use pre-increments unless there's reason to use post-increments.

194–196

You can omit {}

207–211

You can omit {}

267

is?

346

Remove an empty function.

376

auto -> auto *

385

auto -> auto *

505

We usually name this Offset.

Is uint32_t enough? I thought 64-bit executables could be larger than 4 GiB.

585

I'm a bit confused about what this function does. Does this calculate file offsets as opposed to virtual memory offsets? In general, this file needs more comments.

608–613

Looks like it's the same as

if (WasmFile->memories().size() > 1)
  fatal(File->getName() + ": contains more than one memory");
628

No else after fatal, as fatal does not return.

765–771

You could write

void wasm::writeResult() { Writer().run(); }

without these namespaces.

wasm/Writer.h
23–27

It seems you are not using this struct.

wasm/WriterUtils.cpp
24

Use the LLVM naming convention.

35–36

fatal is _Noreturn.

40–41

Instead of enclosing the entire file, can you explicitly add wasm:: to each public functions?

wasm/WriterUtils.h
40–44

These two functions look the same.

ruiu added a comment.Sep 19 2017, 4:33 PM

I'm sorry that I accidentally hit "Submit" button while I was writing a comment.

So, generally, this patch is improved in a right direction. There are still some code that I don't understand yet, mainly in Writer, but it is probably due to my lack of knowledge about wasm.

Thank you for all the feedback @ruiu. It not surprise to me that a patch of this size takes a long time to review. I'm currently working on a fairly large change to allow data sections with the same name to be combined into a single section.. in order to allow things like init_array/fini_array. My aim is to get that done before the initial version of this patch lands.

sbc100 updated this revision to Diff 116282.Sep 21 2017, 3:20 PM
sbc100 marked 33 inline comments as done.
  • [WebAssembly] Update test output to match llvm changes
  • [WebAssembly] only run wasm tests when wasm is enabled
  • [WebAssembly] Review feedback
sbc100 added inline comments.Sep 21 2017, 3:20 PM
test/wasm/archive.ll
2

I added lit.local.cfg to this directory to disable them all in this case.

wasm/InputFiles.cpp
167–172

Yes, you are right. The work i'm doing to combine data sections will address, and makes this code much cleaner. So this will be fixed on my next upload.

209

Yes, llvm-ar would be needed to great the archive indexes.

I see yes, it might be good to handle archives with no symbol table. I'll add a TODO note here.

267

Ah, I was copying from the COFF linker. We should probably make this consistent at some point (perhaps as good candidate for shared code?)

wasm/OutputSections.cpp
57–58

switched to llvm_unreachable

196

True, but the getSize() below depends on Header, so we need the value before the end of the function. I could refactor to avoid this, especially since its only a logging statement.

wasm/OutputSections.h
25–26

I'm creating these using emplace_back() which seems to require a constructor. I could just use push_back() instead I guess?

ruiu added a comment.Sep 24 2017, 4:18 PM

Is this ready for another round of review?

wasm/WriterUtils.cpp
39

More conventional way of doing it is to add using namespace lld; at the start of each file and do not enclose a file with namespace lld {}.

sbc100 updated this revision to Diff 120681.Oct 27 2017, 12:59 PM
  • Merge branch 'master' into wasm
  • [WebAssembly] Use UINT32_MAX rather than 0 for adddresses of undefined globals/functions
  • [WebAssembly] Combine data segments with the same name
  • Reflect upstream path changes to include/lld/common (#7)
  • Delete lldConfig library from CMakeFiles.txt
  • Add lldCommon to wasm LINK_LIBS (left over from last update)
  • Merge branch 'master' into wasm
  • Fix a merge conflict
  • Merge remote-tracking branch 'origin/master' into wasm
  • [WebAssembly] Ensure than name section is in function index order
  • [WebAssembly] Fix for SLEB relocations of type UINT32_MAX.
  • Merge remote-tracking branch 'origin/master' into wasm_work
  • [WebAssembly] Switch to Common/ErrorHandler.h
  • [WebAssembly] use Common/Threads.h
ruiu added a comment.Oct 30 2017, 4:06 PM

This patch generally looking good. Looks like all my big concerns have already been resolved.

I have a little concern on documentation. Since wasm is a new object file format, and it employs different concepts than ELF or COFF, we cannot assume any prior knowledge to readers on wasm file format or linking semantics other than the common knowledge about linking. So, in short, I think this patch needs a bit more comment. Particularly, comments from high level would help readers understand your code, because once you understand a concept, minor details of your code will look pretty obvious. (This is a pretty straightforward linker implementation, so once you understand what it is supposed to do, it is easy to understand the code, I think.)

tools/lld/CMakeLists.txt
21

Looks like <arch>-<toolname> is more common than <toolname>-<arch>. For example, cross gcc for AVR installed on my system is avr-gcc and not gcc-avr. So, maybe wasm-ld is a better name than ld-wasm.

Sorry for back-and-forth.

wasm/InputFiles.cpp
136

Please add a brief function comment saying that this function returns a segment that contains a given symbol.

156–159

nit: pre-++ is a bit more C++-ish.

234

count++ → ++Count

245

each other.)

wasm/InputFiles.h
84–86

They are private functions, no?

118

nit: add a blank line.

124–128

Can you add a blank line before each line comment?

wasm/InputSegment.h
8

Can you describe what is "segment" in wasm? I was thinking that segments are another name for output sections, but looks like it is a different concept.

wasm/OutputSections.cpp
34

Return StringRef instead of const char *.

57–58

Looks like this function is not updated.

wasm/Symbols.cpp
36

Can you add function comments to functions defined in this file?

wasm/Writer.cpp
70

Looks like you didn't update this file for my previous comments.

sbc100 updated this revision to Diff 122722.Nov 13 2017, 1:49 PM
sbc100 marked 14 inline comments as done.
  • [WebAssembly] Handle early exit via exitLld
  • Merge remote-tracking branch 'origin/master' into wasm
  • [WebAssembly] Fix --verbose flag
  • [WebAssembly] Fix gcc warnings/errors
  • [WebAssembly] Fix code relocation for large output files
  • Merge remote-tracking branch 'origin/master' into wasm
  • [WebAssembly] Support comments in wasm.syms
  • Merge remote-tracking branch 'origin/master' into wasm
  • [WebAssembly] Update for llvm change
  • [WebAssembly] Review feedback
sbc100 updated this revision to Diff 122735.Nov 13 2017, 3:04 PM
  • add docs
sbc100 updated this revision to Diff 122741.Nov 13 2017, 3:48 PM
  • Remove whole-file namespace wrapping

Hi ruiu.

I've added some documentation, removed the whole-file namesspace wrapping, integrated with the recent libCommon changes and upstream llvm changes.

Hopefully we can land this soon, so we can move to incremental diffs that will be easier to review. Or maybe you think I think try to be more feature complete for the initial version?

ruiu accepted this revision.Nov 16 2017, 4:17 AM

I'd write more documents to describe the wasm object file format and the wasm linker's semantics, but we can do that later.

LGTM with these changes.

Thank you very much for your patience!

docs/WebAssembly.rst
9

mimic

34

functions

wasm/Driver.cpp
83

If you change this to LinkerDriver().link(Args), can you remove global variable Driver?

151

Could you revert this '#' check if you don't need it now? I mean if we want to add a way to have a comment in some file, we want to do the same thing for ELF and wasm, instead of inventing a different way for each target.

This revision is now accepted and ready to land.Nov 16 2017, 4:17 AM
sbc100 updated this revision to Diff 123246.Nov 16 2017, 2:59 PM
sbc100 marked an inline comment as done.
  • Fix use after free
  • Merge remote-tracking branch 'origin/master' into wasm
  • [WebAssembly] Add test case for output >128 functions
  • [WebAssembly] Refactor the way relocations are applied and written
  • [WebAssemlby] Parallel code writing
  • [WebAssembly] Parallel data writing
  • [WebAssembly] clang-format
  • [WebAssembly] IWYU
  • fix typo in addDefinedGlobal
  • Merge branch 'master' into wasm
  • review feedback
aheejin added inline comments.Nov 16 2017, 3:16 PM
wasm/Driver.cpp
103

After this was changed from const to constexpr, VS 2017 compiler on the waterfall bot does not compile this, saying:

tools\lld\wasm\Options.inc(28): error C2131: expression did not evaluate to a constant
tools\lld\wasm\Options.inc(28): note: failure was caused by non-constant arguments or reference to a non-constant symbol

I'm not really sure what the problem is, since this looks fine to me, but anyway could you possibly revert this to const, to make the VS2017 happy?

sbc100 updated this revision to Diff 123249.Nov 16 2017, 3:23 PM
  • fix windows build
sbc100 updated this revision to Diff 123255.Nov 16 2017, 3:30 PM
  • final tweaks (i hope!)

No problem! Glad to land this initial version. Lets hope it sticks ..

docs/WebAssembly.rst
34

Reworded.

wasm/Driver.cpp
103

Ok, I reverted that for now. The other linker backed all use const rather than `constexpr` anyway.

151

This is already a completely wasm-specific feature. And it quite useful to support comments at the moment.

See: https://github.com/jfbastien/musl/blob/wasm-prototype-1/arch/wasm32/wasm.syms

In the long run I think we might be able to find a better solution to this. Some way to inform the linker about the expected undefined symbols. Maybe some kind of declspec in the source code?

sbc100 edited the summary of this revision. (Show Details)Nov 16 2017, 3:30 PM
This revision was automatically updated to reflect the committed changes.