Page MenuHomePhabricator

[WebAssembly] Implement thread-local storage (local-exec model)
ClosedPublic

Authored by quantum on Jul 10 2019, 3:15 PM.

Details

Summary

Thread local variables are placed inside a .tdata segment. Their symbols are
offsets from the start of the segment. The address of a thread local variable
is computed as __tls_base + the offset from the start of the segment.

.tdata segment is a passive segment and memory.init is used once per thread
to initialize the thread local storage.

__tls_base is a wasm global. Since each thread has its own wasm instance,
it is effectively thread local. Currently, __tls_base must be initialized
at thread startup, and so cannot be used with dynamic libraries.

__tls_base is to be initialized with a new linker-synthesized function,
__wasm_init_tls, which takes as an argument a block of memory to use as the
storage for thread locals. It then initializes the block of memory and sets
__tls_base. As __wasm_init_tls will handle the memory initialization,
the memory does not have to be zeroed.

To help allocating memory for thread-local storage, a new compiler intrinsic
is introduced: __builtin_wasm_tls_size(). This instrinsic function returns
the size of the thread-local storage for the current function.

The expected usage is to run something like the following upon thread startup:

__wasm_init_tls(malloc(__builtin_wasm_tls_size()));

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Apply review feedback

I wonder if __tls_base should be allocated by the embedder (or by the parent/creator thread). Then it could be an *immutable* global import that is allocated up front. I guess __stack_pointer should be treated in the same way (except mutable).

I don't want to block this change on this, but just putting the idea out there.

lld/test/wasm/data-layout.ll
43 ↗(On Diff #209274)

Is there any reason these symbols need to exist in non-threaded builds. The cost of two extra globals is small but not nothing.

I'd also rather not skip over them here. If TLS changes I'd like to see this test need updating.

lld/test/wasm/data-segment-merging.ll
32 ↗(On Diff #209391)

Again, can we avoid creating this completely for non-threaded builds?

lld/test/wasm/tls.ll
57 ↗(On Diff #209102)

This is known issue with the disassembler in that it doesn't know where the functions start and stop in executable output, only in object files.

lld/wasm/Writer.cpp
629 ↗(On Diff #209391)

Maybe write this as a single conditional so its clear even without this comment?

Its really great to see this change BTW! Thanks.

quantum updated this revision to Diff 209535.Jul 12 2019, 11:03 AM

Guard __wasm_init_tls and TLS globals behind a flag (currently --shared-memory).

@tlively and I discussed offline and agreed that it's probably best to use the existing flag instead of adding another flag that needs to be passed when linking multi-threaded executables.

quantum marked 4 inline comments as done.Jul 12 2019, 11:04 AM
quantum added inline comments.
lld/wasm/Writer.cpp
629 ↗(On Diff #209391)

Changed in latest diff.

quantum updated this revision to Diff 209560.Jul 12 2019, 12:18 PM
quantum marked an inline comment as done.

Switch to using ISelDAGToDAG instead of ISelLowering

quantum updated this revision to Diff 209566.Jul 12 2019, 12:36 PM

Don't show a crash report for incorrect flags to use TLS

@sbc100 As for __tls_base being immutable, I am not sure how this will interact with dynamic linking once we implement it. On native platforms, dlopened libraries have their TLS blocks allocated on first use.

What exactly happens to dlopened libraries when there are already threads around? Would it be feasible to create TLS blocks for when the library is loaded on existing thread?

If there's any chance this TLS ABI could be useful for WASI (I don't know if there's been any WASI work on threads yet, but it seems like there's no reason it couldn't be), then we should start a doc in tool-conventions for it. If not then we should get it behind the emscripten OS in LLVM. (and document it anyway; either in tool-conventions or somewhere in the emscripten site).

Oh, btw, any reason these have to be passive segments? Why can't we just make them active segments and let the VM initialize them for us?

Oh, btw, any reason these have to be passive segments? Why can't we just make them active segments and let the VM initialize them for us?

These need to be passive segments so that we can initialize it in different locations in linear memory for every thread.

If we had multiple memories, TLS could have been implemented as a separate linear memory, but we are stuck with mallocing and initializing for now.

If there's any chance this TLS ABI could be useful for WASI (I don't know if there's been any WASI work on threads yet, but it seems like there's no reason it couldn't be), then we should start a doc in tool-conventions for it. If not then we should get it behind the emscripten OS in LLVM. (and document it anyway; either in tool-conventions or somewhere in the emscripten site).

Documenting this in tool-conventions is planned.

quantum updated this revision to Diff 209656.Jul 12 2019, 5:42 PM

Use signed LEB128 for i32.const. Basically, do what D64612 did.

aheejin added inline comments.Jul 12 2019, 5:51 PM
lld/test/wasm/data-segments.ll
7 ↗(On Diff #209566)

What is the difference between ACTIVE and ACTIVE-TLS? It looks we don't have different build processes for them. And as what @sbc100 said, can we exclude TLS from build?

lld/test/wasm/tls.ll
6 ↗(On Diff #209566)

Can we possibly mix one non-tls global variable between tls1 and tls2, just to see they work together?

lld/wasm/Writer.cpp
431 ↗(On Diff #209102)

Ah you're right.

514 ↗(On Diff #209102)

PIng. It doesn't look like fixed yet

629 ↗(On Diff #209391)

Has it been reflected? It says it has been changed but it doesn't look like it has

llvm/lib/Target/WebAssembly/WebAssemblyISelDAGToDAG.cpp
203 ↗(On Diff #209566)

These two lines can be shortened to

ReplaceNode(Node, TLSAddress);

The same applies to the code below for __tls_size too.

208 ↗(On Diff #209566)

Nit: Might be a bit cleaner to extract the expression, which is complicated, like

unsigned IntNo = cast<ConstantSDNode>(Node->getOperand(0))->getZExtValue());
switch (IntNo) {
 ...
quantum updated this revision to Diff 209658.Jul 12 2019, 6:07 PM
quantum marked 14 inline comments as done.

Change per review feedback.

lld/test/wasm/data-segments.ll
7 ↗(On Diff #209566)

ACTIVE-TLS is for builds with TLS enabled. Currently, we use --shared-memory to determine that, per @tlively's recommendation. The rationale is that we don't want even more flags that need to be passed in a proper threaded build.

lld/test/wasm/tls.ll
6 ↗(On Diff #209566)

Done.

lld/wasm/Writer.cpp
514 ↗(On Diff #209102)

Looks like the fix got rebasing against the change that made everything lowerCamelCase. Going to do it again.

629 ↗(On Diff #209391)

I think you are commenting on an old version? It is a single condition on the latest version.

llvm/lib/Target/WebAssembly/WebAssemblyISelDAGToDAG.cpp
203 ↗(On Diff #209566)

Nice. I didn't know that was possible. Changing.

208 ↗(On Diff #209566)

Done.

aheejin added inline comments.Jul 12 2019, 6:54 PM
lld/test/wasm/data-segments.ll
7 ↗(On Diff #209566)

Then if we don't enable --shared-memory, we don't generate those globals? Do we have a test for that?

tlively added inline comments.Jul 12 2019, 8:30 PM
lld/wasm/Driver.cpp
543 ↗(On Diff #209102)

Until we do implement TLS for shared modules, would it make sense to omit these globals and function from shared modules, since we can't use them in that context anyway?

lld/wasm/Writer.cpp
638 ↗(On Diff #209274)

It would be really great if we could find a way to elide the .bss 0 bytes as a code size optimization. Since we can't assume that the memory is already zeroed the way we can with passive segments, perhaps we can use a memory.fill instruction to zero the memory? Pursuing this in a follow-on CL should be fine.

805 ↗(On Diff #209658)

You could avoid duplicating these lines by making them unconditional.

905 ↗(On Diff #209658)

Can you remind me how the InitTLSFunction interacts with relocatable code? I'm wondering if this should be called up in the condition with the other synthetic functions.

llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
193 ↗(On Diff #209658)

I just realized that if we have atomics but not bulk memory and TLS is stripped, then we will be in the awkward situation of both using atomics and disallowing atomics because the module is not thread safe. I think the best solution would be to go back and forcibly strip both atomics and TLS if either of them would be stripped.

llvm/test/CodeGen/WebAssembly/tls.ll
2 ↗(On Diff #209658)

It would be good to check the negative case, too, i.e with bulk-memory disabled.

6 ↗(On Diff #209658)

Is CHECK still used as a prefix if it not listed in the invocation of FileCheck?

quantum updated this revision to Diff 209953.Jul 15 2019, 1:57 PM
quantum marked 16 inline comments as done.

Deal with review comments.

lld/test/wasm/data-segments.ll
7 ↗(On Diff #209566)

Yes. In this file, the cases with --check-prefix ACTIVE will ensure that the fields are not generated.

lld/wasm/Driver.cpp
543 ↗(On Diff #209102)

That would make sense for now.

lld/wasm/Writer.cpp
638 ↗(On Diff #209274)

Yes, this is a good idea for a follow-on CL.

805 ↗(On Diff #209658)

Right.

905 ↗(On Diff #209658)

I don't think it should. __wasm_init_tls initializes the TLS block for the current module only, so every shared library needs to have its own __wasm_init_tls.

llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
193 ↗(On Diff #209658)

Done.

llvm/test/CodeGen/WebAssembly/tls.ll
2 ↗(On Diff #209658)

Done.

6 ↗(On Diff #209658)

Yes, the default prefix is CHECK.

quantum updated this revision to Diff 209966.Jul 15 2019, 3:07 PM

Fix duplicate end function

The offset field of a segment can be a constant expression which can be a global.get of an imported global. So we could have an imported global __tls_base which is different for each thread, and have an active segment with that as its segment offset?

Another high-level question (based just on reading the CL description): The TLS-size intrinsic is per-function, does that mean that the tls-init function is called for every function? are there just multiple TLS sections per object file?

Another high-level question (based just on reading the CL description): The TLS-size intrinsic is per-function, does that mean that the tls-init function is called for every function? are there just multiple TLS sections per object file?

The TLS-size intrinsic returns the total size of TLS for the module it's called from. The initialization function initializes the TLS for the entire module.

The offset field of a segment can be a constant expression which can be a global.get of an imported global. So we could have an imported global __tls_base which is different for each thread, and have an active segment with that as its segment offset?

I didn't know that it could have been a constant expression. I don't think this would have worked very well on the main thread though, since we need to run malloc before we can compute __tls_base. I think this requires the global to be mutable, if only because we need to be able to initialize it on the main thread.

Where should we call __wasm_init_tls, in case within a library?

Where should we call __wasm_init_tls, in case within a library?

Dynamic libraries are not supported by the local-exec TLS model. Support will be implemented later. The idea is that we'll have an __ensure_tls_initialized function that will malloc and __wasm_init_tls if __tls_base == 0.

LGTM apart from one last comment

llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
193 ↗(On Diff #209658)

We can be more fine grained than this. We should strip atomics or TLS only if the relevant feature is missing or the other one is stripped. It is possible to have thread-locals and bulk memory but not have any atomics to strip.

quantum marked an inline comment as done.Jul 16 2019, 1:45 PM
quantum updated this revision to Diff 210166.Jul 16 2019, 1:45 PM

More fine-grainted stripping.

sbc100 accepted this revision.Jul 16 2019, 1:56 PM
sbc100 added inline comments.
lld/wasm/Writer.cpp
400 ↗(On Diff #210166)

Right now we always compiler with -fdata-sections, but if we didn't then object files might have just one ".tdata" sections. Perhaps drop the final "."?

This revision is now accepted and ready to land.Jul 16 2019, 1:56 PM
quantum updated this revision to Diff 210184.EditedJul 16 2019, 2:35 PM

Removed the trailing .. I would add a test to make sure things work without -fdata-sections, but this->Options.DataSections = true; is hard-coded in WebAssemblyTargetMachine.cpp.

quantum marked an inline comment as done.Jul 16 2019, 2:35 PM
quantum updated this revision to Diff 210187.Jul 16 2019, 2:53 PM

Remove extra braces

quantum updated this revision to Diff 210188.Jul 16 2019, 2:57 PM

Disable atomics when TLS is stripped

tlively accepted this revision.Jul 16 2019, 2:57 PM

Nice work!

This revision was automatically updated to reflect the committed changes.

I had a reply that got eaten here, so I'm going to keep trolling you on your CL since we don't have a design doc for this.
The offset field of a data segment initializer can be a global.get on an imported global. (https://webassembly.github.io/spec/core/valid/instructions.html#constant-expressions). Since each thread is separately instantiated with separate JS, we could have a global import like __tls_base which has a different value in each thread. Then we wouldn't need to manually call the init code anywhere. Would there be other advantages or disadvantages for that?

I had a reply that got eaten here, so I'm going to keep trolling you on your CL since we don't have a design doc for this.
The offset field of a data segment initializer can be a global.get on an imported global. (https://webassembly.github.io/spec/core/valid/instructions.html#constant-expressions). Since each thread is separately instantiated with separate JS, we could have a global import like __tls_base which has a different value in each thread. Then we wouldn't need to manually call the init code anywhere. Would there be other advantages or disadvantages for that?

I already answered this one:

The offset field of a segment can be a constant expression which can be a global.get of an imported global. So we could have an imported global __tls_base which is different for each thread, and have an active segment with that as its segment offset?

I didn't know that it could have been a constant expression. I don't think this would have worked very well on the main thread though, since we need to run malloc before we can compute __tls_base. I think this requires the global to be mutable, if only because we need to be able to initialize it on the main thread.

The problem I found with the import approach is that the main thread cannot use global.get of an imported global to get the location of TLS, mainly because it needs to be malloc'd. To use malloc, the main thread needs to be initialized. This will result in a circular dependency.

Another high-level question (based just on reading the CL description): The TLS-size intrinsic is per-function, does that mean that the tls-init function is called for every function? are there just multiple TLS sections per object file?

The TLS-size intrinsic returns the total size of TLS for the module it's called from. The initialization function initializes the TLS for the entire module.

The offset field of a segment can be a constant expression which can be a global.get of an imported global. So we could have an imported global __tls_base which is different for each thread, and have an active segment with that as its segment offset?

I didn't know that it could have been a constant expression. I don't think this would have worked very well on the main thread though, since we need to run malloc before we can compute __tls_base. I think this requires the global to be mutable, if only because we need to be able to initialize it on the main thread.

Interesting. I was assuming we could be clever and make the main thread special and have the linker allocate its tls space upfront (like we do for main thread stack space) rather than do it a runtime, but maybe that needlessly complicated. Seems like ideally we could want to import __tls_base but not for the main thread but I don't think of a way to express that.