This is an archive of the discontinued LLVM Phabricator instance.

[lld] Teach LLD how to parse complete linker scripts
AbandonedPublic

Authored by rafaelauler on Oct 17 2014, 5:02 PM.

Details

Summary

This patch does *not* implement any semantic actions, but it is a first step to
teach LLD how to read complete linker scripts. The additional linker scripts
statements whose parsing is now supported are:

SEARCH_DIR directive
SECTIONS directive
Symbol definitions inside SECTIONS including PROVIDE and PROVIDE_HIDDEN
C-like expressions used in many places in linker scripts
Input to output sections mapping
The goal of this patch was guided towards completely parsing a default GNU ld
linker script and the linker script used to link the FreeBSD kernel. Thus, it
also adds a test case based on the default linker script used in GNU ld for
x86_64 ELF targets.

Note: Sorry for the size of this patch. I started with the goal of incrementally
developing this parser but ultimately failed. On the other side, the logic of the
parser is kind of repetitive, as one would expect, so this patch does not feature
lots of new things.

Diff Detail

Event Timeline

rafaelauler retitled this revision from to [lld] Teach LLD how to parse complete linker scripts.
rafaelauler updated this object.
rafaelauler edited the test plan for this revision. (Show Details)
rafaelauler added reviewers: Bigcheese, shankarke.
rafaelauler added subscribers: ruiu, rafael, Unknown Object (MLST).

I'll address a comment from Rui:

"ruiu added a comment.Via Web · Fri, Oct 17, 9:28 PM
Because this is a large patch, it'll take some time to review.
Let me send out my first comment in the meantime.

Inline Comments
include/lld/ReaderWriter/LinkerScript.h
402
Is there any reason to have these member functions to mutate object's fields? It seems to me that we can make the constructor to take all these values. If we do so, we can handle all BinOps as const, which helps readers to understand the code.

The same comments are applied to all other classes representing ast nodes."

Sure, I agree. I'll try to eliminate these setters as much as possible and upload a new patch.

shankarke requested changes to this revision.Oct 17 2014, 7:29 PM
shankarke edited edge metadata.

Could you add test cases that test for invalid linker scripts as well ?

lib/ReaderWriter/LinkerScript.cpp
977–979

can you convert the assert messages to appropriate error messages that can be displayed to the user ?

This revision now requires changes to proceed.Oct 17 2014, 7:29 PM

Hi Shankar,

Thanks for your input. I agree and will update the patch according your suggestions.

rafaelauler edited edge metadata.

I changed the AST nodes creation according to Rui's suggestion.
I converted some asserts to error messages, according to Shankar's suggestion. There are still some asserts left that are only to check code sanity and are not appropriate for error messages.
I added test cases that exercises parsing errors (incomplete linker scripts) according to Shankar's suggestion.
I also changed the expression parsing functions to eliminate duplicated code and make it simpler.

emaste added a subscriber: emaste.Oct 22 2014, 5:49 AM
ruiu added a comment.Oct 22 2014, 2:24 PM

I'll review this patch today. Please hold on for a while. Thanks.

No problem, Rui, I was waiting for your feedback too. Thanks for putting in time to see this.

ruiu added inline comments.Oct 22 2014, 4:46 PM
include/lld/ReaderWriter/LinkerScript.h
322

Remove three blank lines above.

333

Is constant always unsigned?

426

sort

501

Can you make this enum class?

512

Add an example to the comment. My understanding of this class is it is representing ".text" or "SORT(.text*)" in

SECTIONS {
  .x: {  .text }
  .y { SORT(.text*) }
}
569

example here.

601

example here

639

Or, instead of adding example here, a link to the documentation is better?

For this, https://sourceware.org/binutils/docs/ld/Overlay-Description.html#Overlay-Description

649

Remove

724

In this patch you pass only 1 to this function. Do you really need to look ahead arbitrary number of tokens? The buffer management looks a bit too complicated to me.

805

You want to mention that identifier and "(" were already read.

842

I think we don't usually use \p notation.

lib/ReaderWriter/LinkerScript.cpp
85

sort

96

I think there's an implicit conversion from std::error_code to llvm::ErrorOr. So you can just write

return std::errc::io_error
99
return res;
134

You can write

else if (c >= 'A' && c <= 'F')
  res += c - 'A' + 10;
else
  return llvm::ErrorOr<uint64_t>(std::make_error_code(std::errc::io_error));
147

This needs to be case insensitive.

156

Instead of using two boolean values, you can define variable multiplier and set 1024 for K or 1024*1024 for M.

171

Huh, I didn't know that the linker script supports this way of writing a number. Good to know.

176

startswith_lower

190

Use early return here.

if (res.getError())
   return res;
194

and multiply by the multiplier

211
return '0' <= c && c <= '9'
253

You want to write multiple cases in one line. Or if expression may be shorter.

325

I think we need a function something like

drop(StringRef &s, int n)

that (destructively) drops n characters from s and returns them.

456

This line should be before default and

case '"': case '\'':
469

This expression

(_buffer.startswith("0x") && _buffer.size() > 2 && canContinueNumber(_buffer[2])

seems redundant, because if a string starts with "0x", it will always satisfy

canStartNumber(_buffer[0])
552

Sort

748
return 1;
758

Did you forget to add the trailing "("?

762

Ditto

860

remove space before \n

1561

This can be

if (peek(1)._kind == Token::l_paren)
  return parseFunctionCall();
Symbol *sym = ...
...
return sym;
1676

Remove this if

1682

s/LHS/lhs/

1700

s/RHS/rhs/

1703

Move consumeToken()s after this switch statement.

1708

lhs

1931

remove

ruiu added inline comments.Oct 22 2014, 4:46 PM
include/lld/ReaderWriter/LinkerScript.h
70

Sort in asciii-betical order.

120

It should be done in a different patch, but these can* member functions have nothing to do with the Lexer class (they don't use member variables of the class etc). These should be non-member function.

157

Sort.

Hi Rui,

I appreciate the thorough review, thanks! I implemented your suggestions and will upload a new patch now.

include/lld/ReaderWriter/LinkerScript.h
70

Done

120

I agree, I will submit another patch soon.

157

Done

322

Done.

333

Added the following comment to explain this:
/ A constant value is stored as unsigned because it represents absolute
/ values. We represent negative numbers by composing the unary '-' operator
/// with a constant.

I also added the UnaryMinus class to represent negative numbers and properly updated the parser, thanks for pointing out.

426

Done

501

Done

512

Done

569

Done

601

Done

639

Done

649

Done

724

I agree. I changed this algorithm to only peek at the next token. I originally wrote this function to perform arbitrary look ahead because I was not sure if the linker script grammar would need it. Currently, I believe we won't need this in the future, so I modified this function to always peek the next token, which is a much simpler problem. If we ever come across the need to look ahead more than 1 token, I'll resubmit this complete algorithm again.

805

Sorry, I'm not sure if I understood your observation here, you see, parseFunctionCall() is responsible for consuming this entire expression, including identifier and left paren.

842

No problem, removed.

lib/ReaderWriter/LinkerScript.cpp
85

Done

96

Thanks for the tip!

99

Fixed

134

Indeed, thanks, updated it.

147

OK. I don't think that StringSwitch has something like StartsWithLower, so I just added another case.

156

Fair, changed it.

176

Fixed

190

Done

194

OK

211

Fixed

253

In fact I did with multiple cases, but clang-format undid it :-) will fix it.

325

Good idea. Done.

456

Done

469

Fixed.

552

Done

748

Done

758

Right, thanks

762

Fixed

860

Fixed

1561

Fixed

1676

Done

1682

Fixed

1700

Fixed

1703

Done

1708

Fixed

1931

Removed

rafaelauler edited edge metadata.

Implemented Rui's suggestions.

silvas added a subscriber: silvas.Oct 23 2014, 8:01 PM

Note: Sorry for the size of this patch. I started with the goal of incrementally
developing this parser but ultimately failed. On the other side, the logic of the
parser is kind of repetitive, as one would expect, so this patch does not feature
lots of new things.

It's totally OK (and I think normal) to go back and split the history into incremental pieces after the fact if you are unhappy with the patch size. That way, you can still get the same high-quality code review that you would have gotten by incrementally developing it, which is most of the benefit of the incremental approach.

As a sanity check, I would also recommend running this over all the linker scripts in the linux kernel and the linker scripts from a typical embedded application. If you haven't already, definitely take a read through http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-December/057421.html

Also, to test your "dump" functionality, you should verify that the various projects still work if you replace the real scripts with "dump"ed scripts.

In test/LinkerScript/sections.test, it looks like you directly used the tool output as the expected output for the test. This doesn't give any confidence in the correctness by itself; if you haven't already done so, definitely try to build some code using the dumped script in place of the real script as a sanity check.

Also, please add links (or at least mention) any resources you used when developing this. What did you use as your "langref"? In the future, somebody is going to have to debug this code and they need to be able to determine if it is doing the right thing (or what the right thing is supposed to be).

lib/ReaderWriter/LinkerScript.cpp
479–489

Don't use doxygen comments inside the function. Here and elsewhere.

test/LinkerScript/expr-precedence.test
27–32 ↗(On Diff #15373)

Having the token dump together with the AST dump is really gross. Could you commit a separate patch that enhances linker-script-test to have two different modes and then clean up these test cases?

test/LinkerScript/missing-operand.test
19 ↗(On Diff #15373)

FileCheck has some special functionality for checking diagnostics that avoids the need to hard-code absolute line numbers. You should use it: http://llvm.org/docs/CommandGuide/FileCheck.html#filecheck-expressions

Also, it should be pretty easy to do caret diagnostics.

Hi Sean, thanks for your input, I didn't know your write-up about linker scripts, it's great!

I'll check the dumped scripts for correct behavior. I put in a comment the reference I used to implement the parser (see LinkerScript.h:692); it's the GNU manual, e.g. https://sourceware.org/binutils/docs/ld/SECTIONS.html#SECTIONS.

I'll work on your suggestions.

Addressed Sean's concerns and tested the dump'ed scripts. This required adjusting some cosmetic issues to make dump'ed scripts 100% parseable by GNU ld. Also added other missing features to parse other kernel linker scripts: unary negation, fill expressions that can be indefinitely large ( =0x909090909090909090909090) and parsing symbol assignments outside SECTIONS.

Sanity checks performed: I tested SPEC userland programs linked by GNU ld, using the linker script dump'ed by this parser, and everything went fine.
I then tested linking the FreeBSD kernel with a dump'ed linker script, installed the new kernel and booted it, everything went fine.

Successfully parsed all linker scripts from the FreeBSD kernel, except ldscript.mips.cfe and ldscript.mips.octeon, which requires the PHDRS command (not implemented yet).

Things not tested: embedded scenarios, since this parser currently does not support the MEMORY command, common in these scenarios.

shankarke requested changes to this revision.Oct 25 2014, 7:45 PM
shankarke edited edge metadata.
shankarke added inline comments.
include/lld/ReaderWriter/LinkerScript.h
44–70

All of the above need to have a prefix.Also you may need to sort all the enumerated constants like others to be consistent.

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

275

const StringRef => StringRef everywhere.

294–338

remove const StringRef, StringRef is const already

334

explicit Constant ? There are other classes that would need explicit as well.

432–444

There are enums which are not classes, and some enums which are plain enums. We would need to be consistent.

Rename the enumeration names as documented in the convention.

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

464–469

Can you add comment on whats the depth of the ternary condition thats supported as well as an example like how you document other commands.

466

explicit again.

496–497

look at previous comments on enums.

689–699

From your tests, do you see any that uses Overlay command ?

lib/ReaderWriter/LinkerScript.cpp
91–138

Could you move all the above functions to lld/Support ? These functions may be useful for even command line parsing for example to support --section-start as well as for other flavors.

307–311

How will this perform ? For large linker scripts drop_front is going to be repeatedly called for every token.

600–605

move this inside the class itself.

777–780

move this inside the class itself.

This revision now requires changes to proceed.Oct 25 2014, 7:45 PM

Hi Shankar,
Glad you made a second round of suggestions, thanks. I answered some of your concerns below, but will also work on your suggestions and update the patch soon.

Best regards,
Rafael Auler

include/lld/ReaderWriter/LinkerScript.h
44–70

I just kept the coding style I found in this file, whose original author is not me. For me, it's no problem to convert this to the official LLVM naming convention.

This enum is already sorted according to Rui's suggestion, by using the ASCII table order.

Correct me if I'm wrong, but I think we can drop the prefix here, according to official LLVM naming rules. If the enum is inside a class, the prefix is not necessary.

275

You're right! Thanks for spotting this!

334

I agree, will do, thanks.

432–444

I guess the criteria was to use enum classes when the enum is defined outside a class, like the one Rui pointed out.

Correct me if I'm wrong, but I think this enum already follows the LLVM standard -- when the enum is inside a class, we may drop the prefix, i.e.: And instead of O_And.

464–469

Will do.

689–699

None. Do you think it's better to remove this, for now?

lib/ReaderWriter/LinkerScript.cpp
307–311

This should be inlined, in fact, it was just a refactoring suggested by Rui that would enable us to write simpler code in the next lines. You're right though, it will be called a lot. On the other hand, I don't think that this code is bad-performing: substr() will just update the size of in the StringRef object and drop_front() will directly compute a new StringRef pointer and return it (pointer arithmetic) and update size as well. But I can profile this and discover if this is a limitation.

600–605

Should we leave these as virtual anchors in the C++ code or is this not a concern for the LLD project?

shankarke added inline comments.Oct 25 2014, 8:50 PM
include/lld/ReaderWriter/LinkerScript.h
44–70

We are following a mixed pattern here, for some we use a kw_ to start the enumeration with, there are _(underscore patterns) and non underscore patterns, I am not sure on what we want to rename but I think it would be nice if we are consistent in naming the enumerations ?

689–699

leave it for now, at the time of Layout we could say this command is not supported :)

lib/ReaderWriter/LinkerScript.cpp
600–605

Leave them as virtual anchors still.

ruiu accepted this revision.Oct 26 2014, 11:42 PM
ruiu added a reviewer: ruiu.

Seems this patch is good enough to land. We may be able to improve it even more but that can be done in a different patch.

As you mentioned, this patch was too large. It could have been split into small incremental patches, each adding syntactic element one by one. This patch was fortunately not risky, but if it was and we couldn't agree that this should land, you might have wasted your time. So please start with a small patch next time. Thanks!

include/lld/ReaderWriter/LinkerScript.h
44–70

Seems only keyword (e.g. "align" or "as_needed") starts with kw_ prefix. That looks fine and this code follows the coding style. I don't see non-keyword (e.g. = or +) needs prefix.

466

This constructor has three parameters so explicit keyword is not needed.

A couple nits, but I agree with Rui that this is good to land.

lib/ReaderWriter/LinkerScript.cpp
609–615

fwiw, my favorite pattern for doing this "intersperse with commas" operation is:

for (int i = 0, e = _args.size(); i != e; ++i) {
  if (i)
    os << ", ";
  _args[i]->dump(os);
}

It's a bit cleaner than the pattern you're currently using.

1915–1919

You use this simpler pattern elsewhere:

for (int i = 0; i != numParen; ++i)
  if (!expectAndConsume(Token::r_paren, "expected )"))
    return nullptr;

also, just above you use this more complicated if-while pattern.

Thanks Sean and Rui for your code review.

lib/ReaderWriter/LinkerScript.cpp
609–615

Nice one, will use it.

1915–1919

Makes sense, I will change this code to use the simpler pattern

Committed revision 221126.

rafaelauler abandoned this revision.Nov 2 2014, 8:21 PM