Page MenuHomePhabricator

[lld] Teach LLD how to parse complete linker scripts
AbandonedPublic

Authored by rafaelauler on Oct 17 2014, 5:02 PM.

Details

Summary

This patch does *not* implement any semantic actions, but it is a first step to
teach LLD how to read complete linker scripts. The additional linker scripts
statements whose parsing is now supported are:

SEARCH_DIR directive
SECTIONS directive
Symbol definitions inside SECTIONS including PROVIDE and PROVIDE_HIDDEN
C-like expressions used in many places in linker scripts
Input to output sections mapping
The goal of this patch was guided towards completely parsing a default GNU ld
linker script and the linker script used to link the FreeBSD kernel. Thus, it
also adds a test case based on the default linker script used in GNU ld for
x86_64 ELF targets.

Note: Sorry for the size of this patch. I started with the goal of incrementally
developing this parser but ultimately failed. On the other side, the logic of the
parser is kind of repetitive, as one would expect, so this patch does not feature
lots of new things.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ruiu added inline comments.Oct 22 2014, 4:46 PM
include/lld/ReaderWriter/LinkerScript.h
68

Sort in asciii-betical order.

121

It should be done in a different patch, but these can* member functions have nothing to do with the Lexer class (they don't use member variables of the class etc). These should be non-member function.

158

Sort.

Hi Rui,

I appreciate the thorough review, thanks! I implemented your suggestions and will upload a new patch now.

include/lld/ReaderWriter/LinkerScript.h
68

Done

121

I agree, I will submit another patch soon.

158

Done

323

Done.

334

Added the following comment to explain this:
/ A constant value is stored as unsigned because it represents absolute
/ values. We represent negative numbers by composing the unary '-' operator
/// with a constant.

I also added the UnaryMinus class to represent negative numbers and properly updated the parser, thanks for pointing out.

394

Done

469

Done

480

Done

537

Done

569

Done

607

Done

617

Done

730

I agree. I changed this algorithm to only peek at the next token. I originally wrote this function to perform arbitrary look ahead because I was not sure if the linker script grammar would need it. Currently, I believe we won't need this in the future, so I modified this function to always peek the next token, which is a much simpler problem. If we ever come across the need to look ahead more than 1 token, I'll resubmit this complete algorithm again.

775

Sorry, I'm not sure if I understood your observation here, you see, parseFunctionCall() is responsible for consuming this entire expression, including identifier and left paren.

812

No problem, removed.

lib/ReaderWriter/LinkerScript.cpp
85

Done

97

Thanks for the tip!

100

Fixed

135

Indeed, thanks, updated it.

148

OK. I don't think that StringSwitch has something like StartsWithLower, so I just added another case.

157

Fair, changed it.

177

Fixed

191

Done

195

OK

212

Fixed

254

In fact I did with multiple cases, but clang-format undid it :-) will fix it.

352

Good idea. Done.

440–441

Done

468

Fixed.

530

Done

725

Done

735

Right, thanks

739

Fixed

837

Fixed

1263

Fixed

1378

Done

1384

Fixed

1402

Fixed

1405

Done

1410

Fixed

1633

Removed

rafaelauler edited edge metadata.

Implemented Rui's suggestions.

silvas added a subscriber: silvas.Oct 23 2014, 8:01 PM

Note: Sorry for the size of this patch. I started with the goal of incrementally
developing this parser but ultimately failed. On the other side, the logic of the
parser is kind of repetitive, as one would expect, so this patch does not feature
lots of new things.

It's totally OK (and I think normal) to go back and split the history into incremental pieces after the fact if you are unhappy with the patch size. That way, you can still get the same high-quality code review that you would have gotten by incrementally developing it, which is most of the benefit of the incremental approach.

As a sanity check, I would also recommend running this over all the linker scripts in the linux kernel and the linker scripts from a typical embedded application. If you haven't already, definitely take a read through http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-December/057421.html

Also, to test your "dump" functionality, you should verify that the various projects still work if you replace the real scripts with "dump"ed scripts.

In test/LinkerScript/sections.test, it looks like you directly used the tool output as the expected output for the test. This doesn't give any confidence in the correctness by itself; if you haven't already done so, definitely try to build some code using the dumped script in place of the real script as a sanity check.

Also, please add links (or at least mention) any resources you used when developing this. What did you use as your "langref"? In the future, somebody is going to have to debug this code and they need to be able to determine if it is doing the right thing (or what the right thing is supposed to be).

lib/ReaderWriter/LinkerScript.cpp
467–481

Don't use doxygen comments inside the function. Here and elsewhere.

test/LinkerScript/expr-precedence.test
28–33

Having the token dump together with the AST dump is really gross. Could you commit a separate patch that enhances linker-script-test to have two different modes and then clean up these test cases?

test/LinkerScript/missing-operand.test
20

FileCheck has some special functionality for checking diagnostics that avoids the need to hard-code absolute line numbers. You should use it: http://llvm.org/docs/CommandGuide/FileCheck.html#filecheck-expressions

Also, it should be pretty easy to do caret diagnostics.

Hi Sean, thanks for your input, I didn't know your write-up about linker scripts, it's great!

I'll check the dumped scripts for correct behavior. I put in a comment the reference I used to implement the parser (see LinkerScript.h:692); it's the GNU manual, e.g. https://sourceware.org/binutils/docs/ld/SECTIONS.html#SECTIONS.

I'll work on your suggestions.

Addressed Sean's concerns and tested the dump'ed scripts. This required adjusting some cosmetic issues to make dump'ed scripts 100% parseable by GNU ld. Also added other missing features to parse other kernel linker scripts: unary negation, fill expressions that can be indefinitely large ( =0x909090909090909090909090) and parsing symbol assignments outside SECTIONS.

Sanity checks performed: I tested SPEC userland programs linked by GNU ld, using the linker script dump'ed by this parser, and everything went fine.
I then tested linking the FreeBSD kernel with a dump'ed linker script, installed the new kernel and booted it, everything went fine.

Successfully parsed all linker scripts from the FreeBSD kernel, except ldscript.mips.cfe and ldscript.mips.octeon, which requires the PHDRS command (not implemented yet).

Things not tested: embedded scenarios, since this parser currently does not support the MEMORY command, common in these scenarios.

shankarke requested changes to this revision.Oct 25 2014, 7:45 PM
shankarke edited edge metadata.
shankarke added inline comments.
include/lld/ReaderWriter/LinkerScript.h
42–68

All of the above need to have a prefix.Also you may need to sort all the enumerated constants like others to be consistent.

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

276

const StringRef => StringRef everywhere.

295

remove const StringRef, StringRef is const already

335

explicit Constant ? There are other classes that would need explicit as well.

400–412

There are enums which are not classes, and some enums which are plain enums. We would need to be consistent.

Rename the enumeration names as documented in the convention.

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

432–437

Can you add comment on whats the depth of the ternary condition thats supported as well as an example like how you document other commands.

434

explicit again.

464–465

look at previous comments on enums.

657–667

From your tests, do you see any that uses Overlay command ?

lib/ReaderWriter/LinkerScript.cpp
92–139

Could you move all the above functions to lld/Support ? These functions may be useful for even command line parsing for example to support --section-start as well as for other flavors.

308–312

How will this perform ? For large linker scripts drop_front is going to be repeatedly called for every token.

577–582

move this inside the class itself.

754–757

move this inside the class itself.

This revision now requires changes to proceed.Oct 25 2014, 7:45 PM

Hi Shankar,
Glad you made a second round of suggestions, thanks. I answered some of your concerns below, but will also work on your suggestions and update the patch soon.

Best regards,
Rafael Auler

include/lld/ReaderWriter/LinkerScript.h
42–68

I just kept the coding style I found in this file, whose original author is not me. For me, it's no problem to convert this to the official LLVM naming convention.

This enum is already sorted according to Rui's suggestion, by using the ASCII table order.

Correct me if I'm wrong, but I think we can drop the prefix here, according to official LLVM naming rules. If the enum is inside a class, the prefix is not necessary.

276

You're right! Thanks for spotting this!

335

I agree, will do, thanks.

400–412

I guess the criteria was to use enum classes when the enum is defined outside a class, like the one Rui pointed out.

Correct me if I'm wrong, but I think this enum already follows the LLVM standard -- when the enum is inside a class, we may drop the prefix, i.e.: And instead of O_And.

432–437

Will do.

657–667

None. Do you think it's better to remove this, for now?

lib/ReaderWriter/LinkerScript.cpp
308–312

This should be inlined, in fact, it was just a refactoring suggested by Rui that would enable us to write simpler code in the next lines. You're right though, it will be called a lot. On the other hand, I don't think that this code is bad-performing: substr() will just update the size of in the StringRef object and drop_front() will directly compute a new StringRef pointer and return it (pointer arithmetic) and update size as well. But I can profile this and discover if this is a limitation.

577–582

Should we leave these as virtual anchors in the C++ code or is this not a concern for the LLD project?

shankarke added inline comments.Oct 25 2014, 8:50 PM
include/lld/ReaderWriter/LinkerScript.h
42–68

We are following a mixed pattern here, for some we use a kw_ to start the enumeration with, there are _(underscore patterns) and non underscore patterns, I am not sure on what we want to rename but I think it would be nice if we are consistent in naming the enumerations ?

657–667

leave it for now, at the time of Layout we could say this command is not supported :)

lib/ReaderWriter/LinkerScript.cpp
577–582

Leave them as virtual anchors still.

ruiu accepted this revision.Oct 26 2014, 11:42 PM
ruiu added a reviewer: ruiu.

Seems this patch is good enough to land. We may be able to improve it even more but that can be done in a different patch.

As you mentioned, this patch was too large. It could have been split into small incremental patches, each adding syntactic element one by one. This patch was fortunately not risky, but if it was and we couldn't agree that this should land, you might have wasted your time. So please start with a small patch next time. Thanks!

include/lld/ReaderWriter/LinkerScript.h
42–68

Seems only keyword (e.g. "align" or "as_needed") starts with kw_ prefix. That looks fine and this code follows the coding style. I don't see non-keyword (e.g. = or +) needs prefix.

434

This constructor has three parameters so explicit keyword is not needed.

A couple nits, but I agree with Rui that this is good to land.

lib/ReaderWriter/LinkerScript.cpp
586–592

fwiw, my favorite pattern for doing this "intersperse with commas" operation is:

for (int i = 0, e = _args.size(); i != e; ++i) {
  if (i)
    os << ", ";
  _args[i]->dump(os);
}

It's a bit cleaner than the pattern you're currently using.

1617–1621

You use this simpler pattern elsewhere:

for (int i = 0; i != numParen; ++i)
  if (!expectAndConsume(Token::r_paren, "expected )"))
    return nullptr;

also, just above you use this more complicated if-while pattern.

Thanks Sean and Rui for your code review.

lib/ReaderWriter/LinkerScript.cpp
586–592

Nice one, will use it.

1617–1621

Makes sense, I will change this code to use the simpler pattern

Committed revision 221126.

rafaelauler abandoned this revision.Nov 2 2014, 8:21 PM