[WebAssembly] Use Symbol class heirarchy. NFC.
ClosedPublic

Authored by sbc100 on Thu, Feb 8, 8:34 PM.

Details

Summary

This brings wasm into line with ELF and COFF in terms of
type symbol types are represented.

Diff Detail

Repository
rLLD LLVM Linker
sbc100 created this revision.Thu, Feb 8, 8:34 PM
sbc100 retitled this revision from [WebAssembly] Use Symbol class heirarchy. NFC. to [WIP] [WebAssembly] Use Symbol class heirarchy. NFC..Thu, Feb 8, 8:35 PM
sbc100 updated this revision to Diff 133557.Thu, Feb 8, 8:39 PM

remove line

dberris accepted this revision.Thu, Feb 8, 9:19 PM
dberris added a subscriber: dberris.

LGTM -- nothing obviously weird from my skim.

wasm/InputFiles.cpp
56

Do you need the else here?

This revision is now accepted and ready to land.Thu, Feb 8, 9:19 PM
sbc100 retitled this revision from [WIP] [WebAssembly] Use Symbol class heirarchy. NFC. to [WebAssembly] Use Symbol class heirarchy. NFC..Fri, Feb 9, 11:27 AM
sbc100 added reviewers: ncw, ruiu.

This change has some pretty serious merge conflict with the symbol change change that is in the pipeline so I won't land it without clear plan (Currently looking the resolving the merge myself, but might wait until the larger change lands).

wasm/InputFiles.cpp
56

I don't need it no, I know I know its not the lld style.. its just read better for me. I'll remove though.

ruiu added a comment.Fri, Feb 9, 12:06 PM

Overall looking very nice. It closes the gap between wasm and other lld ports and make it easier to read.

wasm/InputFiles.cpp
307–318

Ideally it should return FunctionSymbol *.

wasm/InputFiles.h
111

Can you make this type safe? I.e. can you change the type of FunctionSymbols so that you don't need to use cast?

wasm/SymbolTable.cpp
148–149

It is not clear to me why we need addDefined function even though we have addDefinedFunction and addDefinedGlobal functions. What is this for?

198

Ditto -- having addUndefinedFunction and addUndefined doesn't seem orthogonal because the former seems like a subset of the latter.

wasm/Symbols.h
88

This class hierarchy is interesting

Symbol
  FunctionSymbol
    DefinedFunction
    UndefinedFunction
  GlobalSymbol
    DefinedGlobal
    UndefinedGlobal
  Lazy

because in ELF and COFF, defined/undefined is a broader category than function/non-function. But I can see that your class hierarchy just represents how symbols are organized in wasm. (That's one of the reasons why I think that the current design of lld, in which all ports share only the design but not a concrete implementation. If we had to implement coff/elf/wasm/etc on all "unified" code base, that would have been extremely hard to do.)

125

Can you move UndefinedFunction here so that related classes are adjacent in source code?

sbc100 added inline comments.Fri, Feb 9, 1:52 PM
wasm/Symbols.h
88

Indeed. It is quite major difference from ELF here. It only occurred to be when doing this work that ELF doesn't have different undefined types. Whereas in wasm an undefined symbol has a lot of type information (for example undefined functions carry their type signature).

We are a little constrained by your typical OO class hierarchy here because, for example, under this scheme there is no common base class for defined or undefined symbols, not that its a huge problem. I think this scheme is still the best options.

sbc100 added inline comments.Fri, Feb 9, 4:23 PM
wasm/InputFiles.h
111

Hmm... i just tried it and realized why I didn't do it like this in the first place.

The problem is that FunctionSymbols is constructed as we parse the object, and before all symbols have been resolved, and at that time archive symbols are Lazy... and will eventually transform in the DefinedFunction symbols or DefinedGlobal symbols. And Lazy symbol can't be cast to either of those types.

I could perhaps work around this by delaying the construction of this vector, or perhaps by introducing a LazyGlobal and LazyFunction subtype?

ncw added inline comments.Tue, Feb 13, 2:23 AM
wasm/Symbols.h
250–257

Yikes, this is scary!

  1. For one, doing make<SymbolUnion> means that the actual symbol's destructor won't be called at the program end. We're relying on the Symbol classes being "simple" with trivial destructors, but what if someone forgets and sticks a std::vector in there as a member...? It's not ideal, quite fragile.
  2. And similarly, replaceSymbol doesn't deallocate the previous data in the union, in just writes straight over it. The new object will be OK, but the previous one will leak any members. Again not a problem as long as the Symbols all have trivial dtors, but it feels like an accident waiting to happen.
  3. Is it technically Undefined Behaviour? You seem to be relying on the exact details of the base-to-derived pointer adjustments. First we allocate the union, then reinterpret-cast it to a Symbol*, then placement-construct an UndefinedFunction symbol. Then the first bit of UB happens, we assume that the newly-constructed UndefinedFunction, when cast to a Symbol, has the same address. That is, we assume the following:

(void*)(BaseClass*)(new (addr) DerivedClass) == (void*)addr

Then the second bit of UB happens when we do replaceSymbol and make the same assumption again, but in replaceSymbol we further assume that all derived classes will construct their Symbol base at the same offset within the memory block we're placement-constructing them in (it's just another assumption about object layout).

The basic assumption is that base classes are constructed before the derived class, and the base class is at offset zero within the memory when placement-constructed at a specific address.

I can see it's just copying the existing LLD code. What you're trying to achieve is to dynamically change the derived type of an object, so that previously-created pointers to the base class remain valid as pointers to the new object's base class.

There is a solution I can think of that's "safe". Instead of doing reinterpret_cast<Symbol*>(make<SymbolUnion>()), why not instead put the Kind member next to the union, something like this: struct SymbolHandle { int Kind; SymbolUnion U; } and then store pointers to the union-with-kind. You can then safely destruct the union using a switch, and safely replace the union too; and finally, you can add automatic casting operators that allow casting a SymbolHandle to an UndefinedFunction& etc with an assertion on Kind and a cast on the appropriate member of the union.

sbc100 added inline comments.Tue, Feb 13, 8:49 AM
wasm/Symbols.h
250–257

I think perhaps you raise some good points, but I see this as more of an lld wide discussion. The other ports have always working in this way AFAICT, and it was always my intention to have the wasm port do the same thing.

Perhaps @ruiu can add some documentation about this technique, why it is actually safe in the this context, and what the motivating factors are for using it.

In any case I don't think we should block this change in this design discussion. Lets make the linkers consistent and iterate (together) from there.

ruiu added a comment.Tue, Feb 13, 9:21 AM

As to the use of the placement new, it is an intentional design choice that symbols are trivially destructible. We shouldn't add anything that needs a desturctor to Symbol for performance reasons. We sometimes allocate literally millions of symbols, and we don't want to call destructors million times. Also, even if we had a destructor for Symbol, they wouldn't be called because we usually use _exit to terminate the process as quickly as possible, without calling the destructor of objects that are used throughout the linking process. lld may not look like very "object-oriented", and that's intentional.

As to the union type, we don't really use it as a union. That is a convenient way to allocate a memory that is large enough to hold any Symbol-derived type, but we don't access any member of it. As you wrote, maybe we should make it explicit by changing (Symbol *)make<SymbolUnion>() to reinterpret_cast<Symbol *>(make<SymbolUnion>()) (I actually wrote that cast with reinterpret_cast in mind, without noticing that there's other way of interpreting it), or if the existence of the union causes confusion like that, we should remove SymbolUnion and define char[maximum symbol size] for a symbol instead. The point is that the union isn't important. We just want to allocate blank memory for symbols. We identify object type using LLVM's classof mechanism, and I think that works just fine.

sbc100 updated this revision to Diff 134059.Tue, Feb 13, 9:24 AM
  • Add static assert
ruiu added a comment.Tue, Feb 13, 9:28 AM

As to the union type, we don't really use it as a union. That is a convenient way to allocate a memory that is large enough to hold any Symbol-derived type, but we don't access any member of it. As you wrote, maybe we should make it explicit by changing (Symbol *)make<SymbolUnion>() to reinterpret_cast<Symbol *>(make<SymbolUnion>()) (I actually wrote that cast with reinterpret_cast in mind, without noticing that there's other way of interpreting it), or if the existence of the union causes confusion like that, we should remove SymbolUnion and define char[maximum symbol size] for a symbol instead. The point is that the union isn't important. We just want to allocate blank memory for symbols. We identify object type using LLVM's classof mechanism, and I think that works just fine.

On second thought, it is already interpreted as interpret_cast<> and not a UB, no? SymbolUnion is not a union of Symbols, but is a union of char[].

sbc100 updated this revision to Diff 134073.Tue, Feb 13, 10:12 AM
  • Add static assert
  • reinterpret_cast
sbc100 updated this revision to Diff 134088.Tue, Feb 13, 11:10 AM
  • Add static assert
  • reinterpret_cast
  • cleanup
sbc100 added inline comments.Tue, Feb 13, 11:57 AM
wasm/Symbols.h
250–257

FWIW, it looks it should be easy to protect against (1) and (2) by adding:

static_assert(std::is_trivially_destructible<T>(),                             
              "Symbol types must be trivially destructible");

The spec on this explicitly says Storage occupied by trivially destructible objects may be reused without calling the destructor..

That still leaves the concerns you raise in (3) of course.

sbc100 updated this revision to Diff 134093.Tue, Feb 13, 12:04 PM
  • reorder classes
sbc100 marked an inline comment as done.Tue, Feb 13, 12:05 PM
sbc100 added inline comments.
wasm/SymbolTable.cpp
148–149

Renamed to reflect their purpose .. addSyntheticGlobal etc..

ruiu accepted this revision.Tue, Feb 13, 1:10 PM

LGTM

Feel free to commit whenever convenient for you.

ncw added a comment.EditedWed, Feb 14, 4:36 AM

As to the union type, we don't really use it as a union. That is a convenient way to allocate a memory that is large enough to hold any Symbol-derived type, but we don't access any member of it. As you wrote, maybe we should make it explicit by changing (Symbol *)make<SymbolUnion>() to reinterpret_cast<Symbol *>(make<SymbolUnion>()) (I actually wrote that cast with reinterpret_cast in mind, without noticing that there's other way of interpreting it), or if the existence of the union causes confusion like that, we should remove SymbolUnion and define char[maximum symbol size] for a symbol instead. The point is that the union isn't important. We just want to allocate blank memory for symbols. We identify object type using LLVM's classof mechanism, and I think that works just fine.

On second thought, it is already interpreted as interpret_cast<> and not a UB, no? SymbolUnion is not a union of Symbols, but is a union of char[].

I'm happy with the assertions that the types are trivially-destructible, that makes good sense.

I'm still a bit uneasy that there might be UB. You're assuming that the base class is positioned right at the front of the class: you take the address of the union and reinterpret it as a Symbol*. Then, when you actually construct the object using placement new, you're assuming that the base-class pointer is unadjusted:

T *replaceSymbol(Symbol *S, ArgT &&... Arg) {
  T *Derived = new (S) T(std::forward<ArgT>(Arg)...);
  assert(static_cast<Symbol*>(Derived) == S); // This is what you are requiring to be true
  return Derived;
}

And yet: the compiler's (implementation-defined) rules for object layout might not put the base class there at the same address as the start of the memory. (And possibly the compile environment could redefine placement new to use padding, possibly by some kind of sanitizer guards?)

I think that for "standard layout" class, the object layout has to put the base class first in memory (http://en.cppreference.com/w/cpp/language/data_members). Unfortunately, the Symbol classes are not "standard layout", so I can't see any standards-provided guarantee that the base class is where you want it to be.

I don't want to be a pesky language lawyer. Would it be OK to add the assertion above?

wasm/Symbols.h
250–257

"have always worked this way" - I think it's actually recent, looks like the change dates to 31 Oct 2017.

Agreed that it's OK to change Wasm to match.

In terms of placement new, the implementation can only every assume sizeof(T) storage. Its documented that users of placement new are supposed to allocate sizeof(T) bytes and pass it in: http://en.cppreference.com/w/cpp/language/new.

I think added the extra assert seems reasonable. I'm not familiar with object layout rules. I think that can happen as a followup and we can add it to all three implementations.

wasm/Symbols.h
250–257

I guess so. I was thinking that they have always done similar tricks involving in-place replacement of the symbols with sub-types of symbol. But I don't know if the old behaviour was more or less risky.

Also, doesn't the existing assert do exactly what you want already:

assert(static_cast<Symbol *>(static_cast<T *>(nullptr)) == nullptr &&          
         "Not a Symbol");

?

Closed by commit rLLD325150: [WebAssembly] Use a Symbol class heirarchy. NFC. (authored by sbc, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.
This revision was automatically updated to reflect the committed changes.
ncw added a comment.Thu, Feb 15, 12:41 AM

Also, doesn't the existing assert do exactly what you want already:

assert(static_cast<Symbol *>(static_cast<T *>(nullptr)) == nullptr &&          
         "Not a Symbol");

?

D'oh! I hadn't spotted that one in amongst the rest, it covers exactly one of the cases I was worried about.

OK, I think the existing asserts cover it all then, plus your helpful comment about placement new clears that up.

Whew, thanks for adding the other assertions. I agree now it's all safe. (Although, a standards-conforming compiler could quite legally fail the assertion, no compilers will in practice and you'd just get a build failure. So that's OK.)