This is an archive of the discontinued LLVM Phabricator instance.

[lld] Teach LLD how to parse complete linker scripts
AbandonedPublic

Authored by rafaelauler on Oct 17 2014, 5:02 PM.

Details

Summary

This patch does *not* implement any semantic actions, but it is a first step to
teach LLD how to read complete linker scripts. The additional linker scripts
statements whose parsing is now supported are:

SEARCH_DIR directive
SECTIONS directive
Symbol definitions inside SECTIONS including PROVIDE and PROVIDE_HIDDEN
C-like expressions used in many places in linker scripts
Input to output sections mapping
The goal of this patch was guided towards completely parsing a default GNU ld
linker script and the linker script used to link the FreeBSD kernel. Thus, it
also adds a test case based on the default linker script used in GNU ld for
x86_64 ELF targets.

Note: Sorry for the size of this patch. I started with the goal of incrementally
developing this parser but ultimately failed. On the other side, the logic of the
parser is kind of repetitive, as one would expect, so this patch does not feature
lots of new things.

Diff Detail

Event Timeline

rafaelauler retitled this revision from to [lld] Teach LLD how to parse complete linker scripts.
rafaelauler updated this object.
rafaelauler edited the test plan for this revision. (Show Details)
rafaelauler added reviewers: Bigcheese, shankarke.
rafaelauler added subscribers: ruiu, rafael, Unknown Object (MLST).

I'll address a comment from Rui:

"ruiu added a comment.Via Web · Fri, Oct 17, 9:28 PM
Because this is a large patch, it'll take some time to review.
Let me send out my first comment in the meantime.

Inline Comments
include/lld/ReaderWriter/LinkerScript.h
402
Is there any reason to have these member functions to mutate object's fields? It seems to me that we can make the constructor to take all these values. If we do so, we can handle all BinOps as const, which helps readers to understand the code.

The same comments are applied to all other classes representing ast nodes."

Sure, I agree. I'll try to eliminate these setters as much as possible and upload a new patch.

shankarke requested changes to this revision.Oct 17 2014, 7:29 PM
shankarke edited edge metadata.

Could you add test cases that test for invalid linker scripts as well ?

lib/ReaderWriter/LinkerScript.cpp
991–993

can you convert the assert messages to appropriate error messages that can be displayed to the user ?

This revision now requires changes to proceed.Oct 17 2014, 7:29 PM

Hi Shankar,

Thanks for your input. I agree and will update the patch according your suggestions.

rafaelauler edited edge metadata.

I changed the AST nodes creation according to Rui's suggestion.
I converted some asserts to error messages, according to Shankar's suggestion. There are still some asserts left that are only to check code sanity and are not appropriate for error messages.
I added test cases that exercises parsing errors (incomplete linker scripts) according to Shankar's suggestion.
I also changed the expression parsing functions to eliminate duplicated code and make it simpler.

emaste added a subscriber: emaste.Oct 22 2014, 5:49 AM
ruiu added a comment.Oct 22 2014, 2:24 PM

I'll review this patch today. Please hold on for a while. Thanks.

No problem, Rui, I was waiting for your feedback too. Thanks for putting in time to see this.

ruiu added inline comments.Oct 22 2014, 4:46 PM
include/lld/ReaderWriter/LinkerScript.h
323

Remove three blank lines above.

334

Is constant always unsigned?

394

sort

469

Can you make this enum class?

480

Add an example to the comment. My understanding of this class is it is representing ".text" or "SORT(.text*)" in

SECTIONS {
  .x: {  .text }
  .y { SORT(.text*) }
}
537

example here.

569

example here

607

Or, instead of adding example here, a link to the documentation is better?

For this, https://sourceware.org/binutils/docs/ld/Overlay-Description.html#Overlay-Description

617

Remove

730

In this patch you pass only 1 to this function. Do you really need to look ahead arbitrary number of tokens? The buffer management looks a bit too complicated to me.

775

You want to mention that identifier and "(" were already read.

812

I think we don't usually use \p notation.

lib/ReaderWriter/LinkerScript.cpp
85

sort

97

I think there's an implicit conversion from std::error_code to llvm::ErrorOr. So you can just write

return std::errc::io_error
100
return res;
135

You can write

else if (c >= 'A' && c <= 'F')
  res += c - 'A' + 10;
else
  return llvm::ErrorOr<uint64_t>(std::make_error_code(std::errc::io_error));
148

This needs to be case insensitive.

157

Instead of using two boolean values, you can define variable multiplier and set 1024 for K or 1024*1024 for M.

172

Huh, I didn't know that the linker script supports this way of writing a number. Good to know.

177

startswith_lower

191

Use early return here.

if (res.getError())
   return res;
195

and multiply by the multiplier

212
return '0' <= c && c <= '9'
254

You want to write multiple cases in one line. Or if expression may be shorter.

352

I think we need a function something like

drop(StringRef &s, int n)

that (destructively) drops n characters from s and returns them.

440–441

This line should be before default and

case '"': case '\'':
468

This expression

(_buffer.startswith("0x") && _buffer.size() > 2 && canContinueNumber(_buffer[2])

seems redundant, because if a string starts with "0x", it will always satisfy

canStartNumber(_buffer[0])
530

Sort

725
return 1;
735

Did you forget to add the trailing "("?

739

Ditto

837

remove space before \n

1263

This can be

if (peek(1)._kind == Token::l_paren)
  return parseFunctionCall();
Symbol *sym = ...
...
return sym;
1378

Remove this if

1384

s/LHS/lhs/

1402

s/RHS/rhs/

1405

Move consumeToken()s after this switch statement.

1410

lhs

1633

remove

ruiu added inline comments.Oct 22 2014, 4:46 PM
include/lld/ReaderWriter/LinkerScript.h
68

Sort in asciii-betical order.

121

It should be done in a different patch, but these can* member functions have nothing to do with the Lexer class (they don't use member variables of the class etc). These should be non-member function.

158

Sort.

Hi Rui,

I appreciate the thorough review, thanks! I implemented your suggestions and will upload a new patch now.

include/lld/ReaderWriter/LinkerScript.h
68

Done

121

I agree, I will submit another patch soon.

158

Done

323

Done.

334

Added the following comment to explain this:
/ A constant value is stored as unsigned because it represents absolute
/ values. We represent negative numbers by composing the unary '-' operator
/// with a constant.

I also added the UnaryMinus class to represent negative numbers and properly updated the parser, thanks for pointing out.

394

Done

469

Done

480

Done

537

Done

569

Done

607

Done

617

Done

730

I agree. I changed this algorithm to only peek at the next token. I originally wrote this function to perform arbitrary look ahead because I was not sure if the linker script grammar would need it. Currently, I believe we won't need this in the future, so I modified this function to always peek the next token, which is a much simpler problem. If we ever come across the need to look ahead more than 1 token, I'll resubmit this complete algorithm again.

775

Sorry, I'm not sure if I understood your observation here, you see, parseFunctionCall() is responsible for consuming this entire expression, including identifier and left paren.

812

No problem, removed.

lib/ReaderWriter/LinkerScript.cpp
85

Done

97

Thanks for the tip!

100

Fixed

135

Indeed, thanks, updated it.

148

OK. I don't think that StringSwitch has something like StartsWithLower, so I just added another case.

157

Fair, changed it.

177

Fixed

191

Done

195

OK

212

Fixed

254

In fact I did with multiple cases, but clang-format undid it :-) will fix it.

352

Good idea. Done.

440–441

Done

468

Fixed.

530

Done

725

Done

735

Right, thanks

739

Fixed

837

Fixed

1263

Fixed

1378

Done

1384

Fixed

1402

Fixed

1405

Done

1410

Fixed

1633

Removed

rafaelauler edited edge metadata.

Implemented Rui's suggestions.

silvas added a subscriber: silvas.Oct 23 2014, 8:01 PM

Note: Sorry for the size of this patch. I started with the goal of incrementally
developing this parser but ultimately failed. On the other side, the logic of the
parser is kind of repetitive, as one would expect, so this patch does not feature
lots of new things.

It's totally OK (and I think normal) to go back and split the history into incremental pieces after the fact if you are unhappy with the patch size. That way, you can still get the same high-quality code review that you would have gotten by incrementally developing it, which is most of the benefit of the incremental approach.

As a sanity check, I would also recommend running this over all the linker scripts in the linux kernel and the linker scripts from a typical embedded application. If you haven't already, definitely take a read through http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-December/057421.html

Also, to test your "dump" functionality, you should verify that the various projects still work if you replace the real scripts with "dump"ed scripts.

In test/LinkerScript/sections.test, it looks like you directly used the tool output as the expected output for the test. This doesn't give any confidence in the correctness by itself; if you haven't already done so, definitely try to build some code using the dumped script in place of the real script as a sanity check.

Also, please add links (or at least mention) any resources you used when developing this. What did you use as your "langref"? In the future, somebody is going to have to debug this code and they need to be able to determine if it is doing the right thing (or what the right thing is supposed to be).

lib/ReaderWriter/LinkerScript.cpp
467–481

Don't use doxygen comments inside the function. Here and elsewhere.

test/LinkerScript/expr-precedence.test
28–33

Having the token dump together with the AST dump is really gross. Could you commit a separate patch that enhances linker-script-test to have two different modes and then clean up these test cases?

test/LinkerScript/missing-operand.test
20

FileCheck has some special functionality for checking diagnostics that avoids the need to hard-code absolute line numbers. You should use it: http://llvm.org/docs/CommandGuide/FileCheck.html#filecheck-expressions

Also, it should be pretty easy to do caret diagnostics.

Hi Sean, thanks for your input, I didn't know your write-up about linker scripts, it's great!

I'll check the dumped scripts for correct behavior. I put in a comment the reference I used to implement the parser (see LinkerScript.h:692); it's the GNU manual, e.g. https://sourceware.org/binutils/docs/ld/SECTIONS.html#SECTIONS.

I'll work on your suggestions.

Addressed Sean's concerns and tested the dump'ed scripts. This required adjusting some cosmetic issues to make dump'ed scripts 100% parseable by GNU ld. Also added other missing features to parse other kernel linker scripts: unary negation, fill expressions that can be indefinitely large ( =0x909090909090909090909090) and parsing symbol assignments outside SECTIONS.

Sanity checks performed: I tested SPEC userland programs linked by GNU ld, using the linker script dump'ed by this parser, and everything went fine.
I then tested linking the FreeBSD kernel with a dump'ed linker script, installed the new kernel and booted it, everything went fine.

Successfully parsed all linker scripts from the FreeBSD kernel, except ldscript.mips.cfe and ldscript.mips.octeon, which requires the PHDRS command (not implemented yet).

Things not tested: embedded scenarios, since this parser currently does not support the MEMORY command, common in these scenarios.

shankarke requested changes to this revision.Oct 25 2014, 7:45 PM
shankarke edited edge metadata.
shankarke added inline comments.
include/lld/ReaderWriter/LinkerScript.h
42–68

All of the above need to have a prefix.Also you may need to sort all the enumerated constants like others to be consistent.

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

276

const StringRef => StringRef everywhere.

295

remove const StringRef, StringRef is const already

335

explicit Constant ? There are other classes that would need explicit as well.

400–412

There are enums which are not classes, and some enums which are plain enums. We would need to be consistent.

Rename the enumeration names as documented in the convention.

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

432–437

Can you add comment on whats the depth of the ternary condition thats supported as well as an example like how you document other commands.

434

explicit again.

464–465

look at previous comments on enums.

657–667

From your tests, do you see any that uses Overlay command ?

lib/ReaderWriter/LinkerScript.cpp
92–139

Could you move all the above functions to lld/Support ? These functions may be useful for even command line parsing for example to support --section-start as well as for other flavors.

308–312

How will this perform ? For large linker scripts drop_front is going to be repeatedly called for every token.

577–582

move this inside the class itself.

754–757

move this inside the class itself.

This revision now requires changes to proceed.Oct 25 2014, 7:45 PM

Hi Shankar,
Glad you made a second round of suggestions, thanks. I answered some of your concerns below, but will also work on your suggestions and update the patch soon.

Best regards,
Rafael Auler

include/lld/ReaderWriter/LinkerScript.h
42–68

I just kept the coding style I found in this file, whose original author is not me. For me, it's no problem to convert this to the official LLVM naming convention.

This enum is already sorted according to Rui's suggestion, by using the ASCII table order.

Correct me if I'm wrong, but I think we can drop the prefix here, according to official LLVM naming rules. If the enum is inside a class, the prefix is not necessary.

276

You're right! Thanks for spotting this!

335

I agree, will do, thanks.

400–412

I guess the criteria was to use enum classes when the enum is defined outside a class, like the one Rui pointed out.

Correct me if I'm wrong, but I think this enum already follows the LLVM standard -- when the enum is inside a class, we may drop the prefix, i.e.: And instead of O_And.

432–437

Will do.

657–667

None. Do you think it's better to remove this, for now?

lib/ReaderWriter/LinkerScript.cpp
308–312

This should be inlined, in fact, it was just a refactoring suggested by Rui that would enable us to write simpler code in the next lines. You're right though, it will be called a lot. On the other hand, I don't think that this code is bad-performing: substr() will just update the size of in the StringRef object and drop_front() will directly compute a new StringRef pointer and return it (pointer arithmetic) and update size as well. But I can profile this and discover if this is a limitation.

577–582

Should we leave these as virtual anchors in the C++ code or is this not a concern for the LLD project?

shankarke added inline comments.Oct 25 2014, 8:50 PM
include/lld/ReaderWriter/LinkerScript.h
42–68

We are following a mixed pattern here, for some we use a kw_ to start the enumeration with, there are _(underscore patterns) and non underscore patterns, I am not sure on what we want to rename but I think it would be nice if we are consistent in naming the enumerations ?

657–667

leave it for now, at the time of Layout we could say this command is not supported :)

lib/ReaderWriter/LinkerScript.cpp
577–582

Leave them as virtual anchors still.

ruiu accepted this revision.Oct 26 2014, 11:42 PM
ruiu added a reviewer: ruiu.

Seems this patch is good enough to land. We may be able to improve it even more but that can be done in a different patch.

As you mentioned, this patch was too large. It could have been split into small incremental patches, each adding syntactic element one by one. This patch was fortunately not risky, but if it was and we couldn't agree that this should land, you might have wasted your time. So please start with a small patch next time. Thanks!

include/lld/ReaderWriter/LinkerScript.h
42–68

Seems only keyword (e.g. "align" or "as_needed") starts with kw_ prefix. That looks fine and this code follows the coding style. I don't see non-keyword (e.g. = or +) needs prefix.

434

This constructor has three parameters so explicit keyword is not needed.

A couple nits, but I agree with Rui that this is good to land.

lib/ReaderWriter/LinkerScript.cpp
586–592

fwiw, my favorite pattern for doing this "intersperse with commas" operation is:

for (int i = 0, e = _args.size(); i != e; ++i) {
  if (i)
    os << ", ";
  _args[i]->dump(os);
}

It's a bit cleaner than the pattern you're currently using.

1617–1621

You use this simpler pattern elsewhere:

for (int i = 0; i != numParen; ++i)
  if (!expectAndConsume(Token::r_paren, "expected )"))
    return nullptr;

also, just above you use this more complicated if-while pattern.

Thanks Sean and Rui for your code review.

lib/ReaderWriter/LinkerScript.cpp
586–592

Nice one, will use it.

1617–1621

Makes sense, I will change this code to use the simpler pattern

Committed revision 221126.

rafaelauler abandoned this revision.Nov 2 2014, 8:21 PM