This is an archive of the discontinued LLVM Phabricator instance.

[LLD][ELF] - Disambiguate "=fillexp" with a primary expression to allow =0x90 /DISCARD/
ClosedPublic

Authored by grimar on Feb 16 2020, 3:19 AM.

Details

Summary

Fixes https://bugs.llvm.org/show_bug.cgi?id=44903

It is about the following case:

SECTIONS {
  .foo : { *(.foo) } =0x90909090
  /DISCARD/ : { *(.bar) }
}

Here while parsing the fill expression we treated the
"/" of "/DISCARD/" as operator.

With this change, suggested by Fangrui Song, we do
not allow expressions with operators (e.g. "0x1100 + 0x22")
that are not wrapped into round brackets. It should not
be an issue for users, but helps to resolve parsing ambiguity.

Diff Detail

Event Timeline

grimar created this revision.Feb 16 2020, 3:19 AM
Herald added a project: Restricted Project. · View Herald Transcript

Documentation: https://sourceware.org/binutils/docs-2.33.1/ld/Output-Section-Fill.html

https://bugs.llvm.org/show_bug.cgi?id=44903 is related to D64130 . The problem is parsing ambiguity.

llvm-mc -filetype=obj linkerscript/sections-padding.s -o a.o

GNU ld's behavior is strange:

% ld.bfd -T =(printf 'SECTIONS { .mysec : { *(.mysec*) } =0x1100 }') a.o -o a
% readelf -Wx .mysec a

Hex dump of section '.mysec':
  0x00000000 66110011 00110011 00110011 00110011 f...............
  0x00000010 6690                                f.

% ld.bfd -T =(printf 'SECTIONS { .mysec : { *(.mysec*) } =0x1100+2+3 }') a.o -o a
% readelf -Wx .mysec a

Hex dump of section '.mysec':
  0x00000000 66000011 05000011 05000011 05000011 f...............
  0x00000010 6690                                f.

Do we have a more elegant fix:) ?

lld/test/ELF/linkerscript/fill-with-discard.test
3 ↗(On Diff #244863)

-o /dev/null

Consider merging this test into sections-padding.s

Do we have a more elegant fix:) ?

Do you have something particular in mind? The only thing I can think of is to treat '\n' as an expression terminator.
I am not sure it is a right thing to do. I think that tokenizeExpr used in this diff is a correct place for such fix.
It has elegant name FWIW..

MaskRay added a comment.EditedFeb 18 2020, 9:08 AM

Do we have a more elegant fix:) ?

Do you have something particular in mind? The only thing I can think of is to treat '\n' as an expression terminator.
I am not sure it is a right thing to do. I think that tokenizeExpr used in this diff is a correct place for such fix.
It has elegant name FWIW..

Treating \n as an expression terminator should work. For the following example, we will accept a but not b (GNU ld accepts both):

a = (3
  +4);
b = 3
  +4;

I prefer an alternative: limit = to accept readPrimary instead of readFill:

if (peek() == "=" || peek().startswith("=")) {
  inExpr = true;
  consume("=");
  cmd->filler = readFill();
  inExpr = false;
}

As you can see, GNU ld's = syntax is weird...

grimar added a comment.EditedFeb 26 2020, 3:01 AM

I prefer an alternative: limit = to accept readPrimary instead of readFill:

if (peek() == "=" || peek().startswith("=")) {
  inExpr = true;
  consume("=");
  cmd->filler = readFill();
  inExpr = false;
}

Honestly I just do not feel comfortable to do this change.
If I currectly understood, you're suggesting something like:

if (peek() == "=" || peek().startswith("=")) {
  inExpr = true;
  consume("=");
  uint64_t value = readPrimary()().val;
  std::array<uint8_t, 4> buf;
  write32be(buf.data(), (uint32_t)value);

  cmd->filler = buf;
  inExpr = false;
}

With it we stop supporting syntax like SECTIONS { .mysec : { *(.mysec*) } =0x11223300+0x44 }.
It is a deviation from the specification which says that "fillexp is an expression" and mentions a unary +.

As you can see, GNU ld's = syntax is weird...

Yes, but it the current state LLD seems to have a reasonable behavior.
I am not sure we should intentionally ignore specification and remove this already supported feature just because of a specific issue
with "/DISCARD/".

Would it make sense to make /DISCARD/ its own token recognised by the Lexer? I believe that is what BFD does. It would prevent it from being recognised as "/" "DISCARD" "/". Apologies if this is not appropriate, I'm not too familiar with this part of the code.

grimar added a comment.EditedFeb 26 2020, 3:51 AM

Would it make sense to make /DISCARD/ its own token recognised by the Lexer? I believe that is what BFD does. It would prevent it from being recognised as "/" "DISCARD" "/". Apologies if this is not appropriate, I'm not too familiar with this part of the code.

It is what my patch does :) "/DISCARD/" is a single token initially, but during parsing of the FILL expression,
we call tokenizeExpr to return "3", "*" and "5" tokens for "3*5", for example.

I.e. tokenizeExpr should probably ignore and don't split the "/DISCARD/", and that is what I do.

I prefer an alternative: limit = to accept readPrimary instead of readFill:

if (peek() == "=" || peek().startswith("=")) {
  inExpr = true;
  consume("=");
  cmd->filler = readFill();
  inExpr = false;
}

Honestly I just do not feel comfortable to do this change.
If I currectly understood, you're suggesting something like:

if (peek() == "=" || peek().startswith("=")) {
  inExpr = true;
  consume("=");
  uint64_t value = readPrimary()().val;
  std::array<uint8_t, 4> buf;
  write32be(buf.data(), (uint32_t)value);

  cmd->filler = buf;
  inExpr = false;
}

With it we stop supporting syntax like SECTIONS { .mysec : { *(.mysec*) } =0x11223300+0x44 }.
It is a deviation from the specification which says that "fillexp is an expression" and mentions a unary +.

As you can see, GNU ld's = syntax is weird...

Yes, but it the current state LLD seems to have a reasonable behavior.
I am not sure we should intentionally ignore specification and remove this already supported feature just because of a specific issue
with "/DISCARD/".

See my example in https://reviews.llvm.org/D74687#1878409 I suspect =0x10+2+3 never works properly in GNU ld.
If we disallow that in lld, a user can easily work around the limitation by using =(0x10+2+3).

Would it make sense to make /DISCARD/ its own token recognised by the Lexer? I believe that is what BFD does. It would prevent it from being recognised as "/" "DISCARD" "/". Apologies if this is not appropriate, I'm not too familiar with this part of the code.

It is what my patch does :) "/DISCARD/" is a single token initially, but during parsing of the FILL expression,
we call tokenizeExpr to return "3", "*" and "5" tokens for "3*5", for example.

I.e. tokenizeExpr should probably ignore and don't split the "/DISCARD/", and that is what I do.

The trailing tokens of one output section description can cause other ambiguity when parsing the next output section description:

SECTIONS {
  .foo : { } =3
  /a : { }
}

I think it is just not necessary to support a syntax that will inherently cause problems, especially when there is no functionality loss.

grimar planned changes to this revision.Mar 18 2020, 7:18 AM

I've revisited this patch, investigated possible use cases and reviewed our bugs history related.
I think Fangrui is right and I am going to update this diff very soon using his suggestion.

grimar updated this revision to Diff 251081.Mar 18 2020, 7:40 AM
grimar edited the summary of this revision. (Show Details)
  • Reimplemented.
MaskRay accepted this revision.Mar 18 2020, 9:02 AM
This revision is now accepted and ready to land.Mar 18 2020, 9:02 AM
MaskRay added a comment.EditedMar 18 2020, 9:06 AM

The title can probably be improved.

For do not fail parsing when "/DISCARD/" follows the fill expression., I've got one suggestion.

  • Disambiguate =fillexp with a primary expression to allow =0x90 /DISCARD/

A native speaker can suggest a better one:)

grimar retitled this revision from [LLD][ELF] - Linker script: do not fail parsing when "/DISCARD/" follows the fill expression. to [LLD][ELF] - Disambiguate "=fillexp" with a primary expression to allow =0x90 /DISCARD/.Mar 19 2020, 2:38 AM
This revision was automatically updated to reflect the committed changes.