This is an archive of the discontinued LLVM Phabricator instance.

[LLD] Add a new PE/COFF port
Needs ReviewPublic

Authored by ruiu on May 26 2015, 12:10 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This is an initial patch for a section-based COFF linker.

The patch is 2300 lines including comments and blank lines.
Before diving into details, you may want to start from reading
README because it should give you an overview of the design.

All important things are written in the README file, so I write
summary here.

  • The linker is already able to self-link on Windows.
  • It's significantly faster than the existing implementation. The existing one takes 5 seconds to link LLD on my machine, while the new one only takes 1.2 seconds, even though the new one is not multi-threaded yet. (And a proof-of-concept multi- threaded version was able to link it in 0.5 seconds.)
  • It uses much less memory (250MB vs. 2GB virtual memory space to self-host).
  • IMHO the new code is much simpler and easier to read than the existing PE/COFF port.

Diff Detail

Event Timeline

ruiu retitled this revision from to [LLD] Add a new PE/COFF port.May 26 2015, 12:10 PM
ruiu updated this object.
ruiu updated this revision to Diff 26524.May 26 2015, 12:10 PM
ruiu edited the test plan for this revision. (Show Details)
ruiu added a subscriber: Unknown Object (MLST).

Some first round comments.

We should open archive files if you have symbols marked IMAGE_WEAK_EXTERN_SEARCH_LIBRARY.

COFF/InputFiles.cpp
32

Can we chose a better name? This overloads with basename(3).

39

Can we not make this a member function?

204–206

We should have an isAbsolute() and it should use COFF::IMAGE_SYM_ABSOLUTE.

207

Should we follow this path if we see a IMAGE_WEAK_EXTERN_SEARCH_ALIAS symbol?

COFF/README.md
7

s/the LLD/LLD

10–11

Perhaps we should prefer "instead" over "on the other hand" in this instance.

32–35

I think ArchiveSymbol might be a better fit than CanBeDefined.

83

belongs

COFF/Symbols.cpp
36–42

We should probably verify that MS LINK is OK with multiple definitions of a symbol where one of them is COMMON. For example, consider a non-common DefinedRegular symbol whose size is smaller than a common DefinedRegular symbol.

COFF/Symbols.h
49

Perhaps DefinedImportThunkKind

94

If you are going to use comparisons here, please do it like LLVM does like CmpInst::FIRST_ICMP_PREDICATE and CmpInst::LAST_ICMP_PREDICATE by adding additional enumerators.

COFF/Writer.cpp
55–56

This needs a comment.

78

You could do S.split('$').first

ruiu updated this revision to Diff 26534.May 26 2015, 2:02 PM

Address review comments.

emaste added a subscriber: emaste.May 26 2015, 2:21 PM

Added folks who may be interested by this patch to the CC list.

COFF/InputFiles.cpp
32

Renamed getBasename.

39

This function uses ParentName and getName() defined in InputFile class, so maybe it's better to keep it as a member function.

204–206

Done.

207

Added a TODO.

COFF/README.md
32–35

As we discussed offline, "Archive" would conflict with libObject's Archive class, so let's rename "Lazy".

83

Fixed.

COFF/Symbols.cpp
36–42

Added a TODO.

COFF/Symbols.h
49

Done.

94

Done.

COFF/Writer.cpp
55–56

Done.

78

Done.

JFYI, I have plans to carefully review this but unfortunately I wouldn't be able to for the next couple of days. In case this lands without, I'll do a post-commit review.

pcc added a comment.May 26 2015, 6:46 PM

Clang build fixes:

COFF/InputFiles.cpp
243

Here (and in a few other places) you are making assumptions about the host system's data layout. This may be a largely theoretical problem though.

COFF/SymbolTable.cpp
89

Should this diagnose that Undef is undefined because Sym->Body is undefined?

test/COFF/driver.test
4

On Linux this reads "nosuchfile.obj: No such file or directory" (capital N).

majnemer added inline comments.May 26 2015, 6:56 PM
COFF/InputFiles.cpp
243

I agree. I think the right fix is to add a coff_import_header to llvm/Object/COFF.h and reinterpret_cast the Buf.

ruiu added a comment.May 26 2015, 8:01 PM

Peter,

Thank you for the patch. I applied that to my local copy. Will upload a new
patch.

ruiu updated this revision to Diff 26571.May 26 2015, 8:01 PM

Updated as per Peter's comments.

ruiu updated this revision to Diff 26572.May 26 2015, 8:02 PM
ruiu added a comment.May 26 2015, 8:56 PM

Davide,

I need more eyes for this patch than for regular patches this is important.
I expect it'll stay here for a while. Even if you can't make it, I'd really
appreciate if you could review.

If you are not particularly interested in Windows-specific thing, you can
skip the code for the import table, and you'll still be able to get the
meaning of the other code. That piece of code is probably a good example
about how to create chunks of data that's not backed by input sections
(it's very easy, btw), but details about the DLL import table may not be
interesting for you.

This is beautiful!

Please do check it in. Given the time I would be eager to explore a similar design for ELF.

CMakeLists.txt
99

No need to change it now, but what do you think this will look like once we add ELF? Will this be renamed SectionObj or there will be an ELF directory and the common code will be moved to lib?

COFF/Chunks.h
19

Using "using" in headers? Maybe at least move them to a namespace.

40

Is there an out of line key method?

Is this deleted polimorficaly?

44

A more descriptive string would be nice :-)

65

Having a HasData protected member would be more llvm like.

79

COFF has nothing like SHF_MERGE sections, right?

This is called only for COMDAT discarding, right? There is nothing like --gc_sections at the moment.

Where should it print? STDOUT? Document that.

88

This is a bit confusing. If this represents an "output thing", how can it be a root?

Reading the rest of the code makes it clear I just just confused by Chunk being an output entity. It is something that is being "copied" to the output. Not sure where/if to document it, but it is clear enough after reading a bit.

99

Thank you so much for using llvm naming style.

112

s/representing/corresponding/?

228

Why is it called Null then?

COFF/Driver.cpp
124

So, make_dynamic_error_code is a bit of a misfeature IMHO.

There are two things at play

  • Issuing a diagnostic
  • Returning an error

For the diagnostic, something as simple as llvm::errs() is a reasonable start (or passing down a raw_ostream). At some point we really have move the diagnostic handling code out of lib/IR so that it can be used for all of llvm.

For error handling, we just need as many error_code enums as the callers can handle. For example, for all of the bitcode reading we use just InvalidBitcodeSignature, CorruptedBitcode.

169

Does link.exe really use the file name instead of the file signature?
Even if it does, it might be better to pass a MemoryBufferRef to the ArchiveFile and ObjectFile constructors and have them *not* own the buffer.

COFF/InputFiles.cpp
111

dyn_cast instead of isa+cast

150

who owns this?

COFF/InputFiles.h
41

Exposing an iterator might be better, no?

COFF/Memory.h
22

We should make this the one true StringSaver in llvm :-)

COFF/README.md
154

Nice explanation of the differences. I must say I like the COFF way :-)

COFF/Symbols.cpp
24

At least on ELF common is not that simple. The end result of linking two common symbols with the same name is a common symbol with the maximum size and the maximum alignment, so it is possible that it corresponds to none of the input common symbols. Is that the case for COFF?

ruiu updated this revision to Diff 26697.May 28 2015, 10:06 AM

Updated as per Rafael's comments.

ruiu added a comment.May 28 2015, 10:06 AM

Thank you for reviewing!

CMakeLists.txt
99

It depends on how it goes. If we end up having code that works really well for COFF but not fit to ELF, we would decide to share less code, and if that's the case keeping class definitions in the same directory would make sense. If that's not the case, we should move them out of this directory to lib.

COFF/Chunks.h
19

Done.

40

It's deleted polimorfically. Writer has a vector of unique_ptr<Chunk>.

44

Done.

65

Writer uses this member function.

79

It has COMDAT merging; search for "opt:icf". And gc-sections is already implemented in this patch.

112

Done.

228

Added comments.

COFF/Driver.cpp
124

I agree with that. If you want to use this as a library (not in a fancy way, but just by passing a list of mix of strings (command line options) and memory buffer objects (instead of filename string) to link in-memory objects, you want to get a machine-readable error code instead of this kind of human-readable error messages.

However, there's a beauty of this dynamic_error_code thing. It contains a error string. Every time you check for an error, you can construct a new error message by appending prefix to that. For example, if you get an error string "file not found" from a subroutine, you can then construct a new error message "foo.obj: file not found" and then propagate it up to the caller. The caller may append more error messages, like "reading bar.obj's directive section: foo.obj: file not found".

I don't think you can do that with error code enums.

I'll leave this as is as this is not that important, I'll revisit this later.

169

Maybe I should create Driver class and make it own MemoryBuffers. Currently as you can see below it's just a non-member function. Let me do that in a follow-up patch.

I don't think link.exe uses extensions at least this way. This is just my negligence.

COFF/InputFiles.cpp
111

Done.

150

This is a placement new.

COFF/InputFiles.h
41

Are you suggesting adding begin() and end() to InputFile? If so, because a subclass of thiis class contains another iterable object, vector<Chunk *>, adding begin and end is confusing.

COFF/Memory.h
22

Yeah. This is off topic, but do you think if it makes sense to add these member functions to BumpPtrAllocator itself?

COFF/Symbols.cpp
24

The PE/COFF spec uses a term "common symbol" without defining it, there's no explanation what it is. I was thinking to investigate the behavior of the actual linker.

As to how to handle the case you described for ELF, I guess we can change the signature of this function to return ErrorOr<SymbolBody*> and return this, Other or a new SymbolBody object. This doesn't have to be done in this patch, but this needs considering. Thank you for pointing that out.

ruiu updated this revision to Diff 26699.May 28 2015, 10:08 AM

Comment at: COFF/Chunks.h:65
@@ +64,3 @@
+ // by this chunk is filled with zeros.
+ virtual bool hasData() const { return true; }

+

rafael wrote:

Having a HasData protected member would be more llvm like.

Writer uses this member function.

Oh, I meant in addition, so the method would look like

bool hasData() const { return HasData; }

This is called only for COMDAT discarding, right? There is nothing like --gc_sections at the moment.

Where should it print? STDOUT? Document that.

It has COMDAT merging; search for "opt:icf". And gc-sections is already implemented in this patch.

No results for icf :-)

What I mean by --gc-sections is something on the style of ELF, where
even sections that are not in COMDATs can be gced. I know link.exe
doesn't have that, I was just confused by the term "gc".


rafael wrote:

So, make_dynamic_error_code is a bit of a misfeature IMHO.

There are two things at play

  • Issuing a diagnostic
  • Returning an error

For the diagnostic, something as simple as llvm::errs() is a reasonable start (or passing down a raw_ostream). At some point we really have move the diagnostic handling code out of lib/IR so that it can be used for all of llvm.

For error handling, we just need as many error_code enums as the callers can handle. For example, for all of the bitcode reading we use just InvalidBitcodeSignature, CorruptedBitcode.

I agree with that. If you want to use this as a library (not in a fancy way, but just by passing a list of mix of strings (command line options) and memory buffer objects (instead of filename string) to link in-memory objects, you want to get a machine-readable error code instead of this kind of human-readable error messages.

However, there's a beauty of this dynamic_error_code thing. It contains a error string. Every time you check for an error, you can construct a new error message by appending prefix to that. For example, if you get an error string "file not found" from a subroutine, you can then construct a new error message "foo.obj: file not found" and then propagate it up to the caller. The caller may append more error messages, like "reading bar.obj's directive section: foo.obj: file not found".

I don't think you can do that with error code enums.

With enums you can't, that is the point :-)

The idea is to only add an enum value if a caller wants to do
something different for that error case. That is why for bitcode
reading we have just enums for "there was an error" and "this is not a
bitcode file at all": Some callers want to try to open a file and
ignore it if it is not a bitcode, but actual errors have to be
reported.

The diagnostic path is separate and far richer:

if (!isa<PointerType>(PtrType))
  return Error(DH, "Load/Store operand is not a pointer type");

The DH is a DiagnosticHandler callback. It can do pretty much
anything: print to stdout, abort the program, log it over the
internet, open a dialog window, etc. It is also extensible. In the
case of the bitcode reader we pass just the error_code and the user
readable message, but it is possible to include as much context as
needed.

Last but not least, the callback is an incredibly convenient point for
putting a breakpoint when debugging. When the error occurs the entire
stack is still alive.

I'll leave this as is as this is not that important, I'll revisit this later.

I agree, please don't consider any of my comments as blockers. The
performance improvements you showed are more than convincing to get
this committed and iterate afterwards.

rafael wrote:

who owns this?

This is a placement new.

oops :-)

rafael wrote:

Exposing an iterator might be better, no?

Are you suggesting adding begin() and end() to InputFile? If so, because a subclass of thiis class contains another iterable object, vector<Chunk *>, adding begin and end is confusing.

I was thinking of creating the Chunk lazily, but I see it probably
wouldn't help much.

Comment at: COFF/Memory.h:22
@@ +21,3 @@
+
+class StringAllocator {

+public:

rafael wrote:

We should make this the one true StringSaver in llvm :-)

Yeah. This is off topic, but do you think if it makes sense to add these member functions to BumpPtrAllocator itself?

I would probably leave it as a convenience class just like you have
it. It is just that the one in llvm is silly (virtual base with only
one implementation that uses a std::vector).

As to how to handle the case you described for ELF, I guess we can change the signature of this function to return ErrorOr<SymbolBody*> and return this, Other or a new SymbolBody object. This doesn't have to be done in this patch, but this needs considering. Thank you for pointing that out.

Should work.

Thanks again for all the work on this!

The overall patch looks very good to me. Some comments, don't take them as mandatory just suggestions.
Thanks!

COFF/Chunks.cpp
47

Can you add a small comment here explaining why COMDATs are GC roots?

73

using 'auto' here? Or do you think it makes the code less readable?

132

In other places you assign E in the for() initialization, any reason you don't do that here as well? Maybe there are few other places where you do this that I missed.

COFF/InputFiles.cpp
112

No braces needed, I guess.

COFF/Memory.h
30

I think it may make sense to make this function (or set of function(s)) available everywhere, rather than making them specific to lld.
I can imagine the usefulness of having a facility to return a NULL-terminated copy of a string somewhere else in the tree.

COFF/Options.td
22

Can you sort these (alphabetically?)

COFF/Writer.cpp
31

I personally prefer to express power-of-two constant(s) as 1 << SOMETHING; but it's not a strong preference.

ruiu added a comment.May 28 2015, 11:56 AM

Davide,

Thank you for reviewing!

COFF/Chunks.cpp
47

Done.

73

I think explicit type is more readable this case, although we are using C as a variable for Chunk in almost all cases.

132

I think I wrote E before for if E's initializer is too long (I just don't want to line-warp for conditionals).

COFF/InputFiles.cpp
112

I omit {} only when it doesn't have else consistently, so keeping it is probably better.

COFF/Options.td
22

Done.

COFF/Writer.cpp
31

Interesting. I always think in the other way just like this, so page size is 4096 to me than 1<<12 and disk sector is 512 than 1<<9.