This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Fix dropped dollar sign from symbols in branch targets
ClosedPublic

Authored by pratlucas on Jan 22 2020, 5:06 AM.

Details

Summary

ARMAsmParser was incorrectly dropping a leading dollar sign character
from symbol names in targets of branch instructions. This was caused by
an incorrect assumption that the contents following the dollar sign
token should be handled as a constant immediate, similarly to the #
token.

This patch avoids the operand parsing from consuming the dollar sign
token when it is followed by an identifier, making sure it is properly
parsed as part of the expression.

Diff Detail

Event Timeline

pratlucas created this revision.Jan 22 2020, 5:06 AM

If the goal here is to match the GNU assembler, I don't think you've achieved that; gas treats "b $4" as a branch to the symbol "$4", not an absolute offset.

carwil added a subscriber: carwil.Jan 23 2020, 5:16 AM
chill added a subscriber: chill.Jan 23 2020, 7:03 AM
chill added inline comments.
llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
6124

What if we have here $ foo, i.e. whitespace after the dollar? We should not paste together two separate tokens $ and <whatever> to form an identifier, $foo, $12 are identifiers, but $ 12 is not.

pratlucas marked an inline comment as done.Jan 23 2020, 7:43 AM
pratlucas added inline comments.
llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
6124

From what I've checked, parseExpression(...) takes care of this scenario:

<stdin>:36:11: error: invalid token in expression
        b $ foo
          ^

The change is not binding the two actual tokens together, but only refraining from removing the $ token from the expression.
Please let me know if you believe a more active handling of this scenario is necessary.

pratlucas marked an inline comment as done.Jan 23 2020, 7:54 AM
pratlucas added inline comments.
llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
6124

Checking the behavior without the changes from the patch, expressions like b $ foo are currently accepted by the parser with no errors, as the $ token is dropped from the expression.

If the goal here is to match the GNU assembler, I don't think you've achieved that; gas treats "b $4" as a branch to the symbol "$4", not an absolute offset.

Good point @efriedma , I took a look at gas and it indeed handles $4 as a symbol on branch instructions.

I've checked the patch that originated the existing behavior (ef70e9b704f7f482faac0696c7288743061ed652) and it appears to have introduced this intentionally, though. The following part of the commit message got me concerned that we might break the handling of legacy code if matching gas' behavior:

Backwards compatibility with 'gas'. #imm is the preferered and documented syntax, but lots of existing code uses the '$' prefix, so we should support it if we can.

Do you have any thoughts on that? I couldn't find any example of the mentioned "existing code" on the regression tests.

chill added a comment.Jan 24 2020, 6:24 AM

There's obviously lexer ambiguity in interpreting, e.g. the character sequence $4 as a single <identifier> token, or as the two tokens $ and a <constant>.
Looks like the tokenisation needs to be context dependent: in certain directives (.global $4) and instructions (b $4) the sequence needs to be interpreted a single identifier, and
in other contexts (ldr r0, [r0, $4]) as two separate tokens.

chill added inline comments.Jan 24 2020, 6:30 AM
llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
6124

Well, if we have two separate tokens (AsmToken::Dollar and AsmToken::Idenrifier), but somehow end up with symbol name $foo, something must be combining those tokens into one.

That happens in AsmParser::parseIdentifier (https://github.com/llvm/llvm-project/blob/master/llvm/lib/MC/MCParser/AsmParser.cpp#L2844), but (unfortunately?) this function does not handle
a non-identifier immediately following the $.

chill added inline comments.Jan 24 2020, 6:52 AM
llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
6124

(Indeed it check tokens are adjacent).

pratlucas updated this revision to Diff 241077.Jan 29 2020, 2:34 AM

[ARM] Updating handling of $-prefixed operands for branch instructions

Updating handling of the dollar sign token to check for the Mnemonic being
evaluated in order to decide wether to try to parse it as a $-prefixed
symbol or to consider it the start of an immediate expression. The token
should be considered part of a symbol when handling branch instructions.

pratlucas marked 3 inline comments as done.Jan 29 2020, 2:36 AM
chill added a comment.Feb 6 2020, 8:26 AM

Could you, please, add tests for the following cases (some of them will fail, and need to be fixed):

  1. b $4

$4 needs to be interpreted as a symbol name; It's GCC behaviour as well.

  1. b $foo + 4

It's an error now; needs to be like b foo + 4, but with symbol $foo.

  1. b $ 4, b $ foo + 4 and b $ foo

These work as expected.

Rationale: GCC does not support $ as introducing an expression, but does
allow an identifier to start with a $. IMHO, our approach here should be:
when there's ambiguity, resolve it in favour of the GCC's interpretation, e.g. b $4 is parsed
as a branch to symbol.

Regarding the b $4 case, I found that the limitation currently lies in the expression/identifier parsing at the core AsmParser, which only detects $-prefixed Identifier or String tokens at the moment.
As this part of the code has no knowledge of the branch instruction context, I'm afraid updating it to accept Integer tokens as well might cause some inconsistencies in the instructions for which we drop the leading Dollar token. For example:

# Branch instructions consistently treat $<integere> as identifier
b   $4                  @ b ($4)         - Symbol
b   $4 + $4             @ b #($4)+($4)   - Symbol
# Other instructions treat those as identifiers only when they're
# on the begining of the operand
movw  r0, $4            @ movw r0, #4            - Immediate constant
movw  r0, :lower16:$4   @ movw r0, :lower16:($4) - Symbol

From what I checked on the gas behavior, it seems to only treat those as constant immediates when actual constants are explicitly expected by the instruction. Otherwise it assumes those are identifiers.
Would you say the behavior above is the expected one? Or should we go for the alignment with gas' behavior?

chill added a comment.Feb 17 2020, 2:40 AM

IMHO, we should resolve ambiguity between $ starting an expression, and $ being part of an identifier in favour of the latter, reason
being that using $ instead of # is a "non-standard" extension, whereas dollars in identifiers must be fully supported.

I agree with resolving the ambiguity in favor of identifiers regardless of the mnemonic for the general-case operand parsing, avoiding inconsistencies. Specific cases where constant immediates are explicitly expected are currently handled by their own ParseMethod implementation.
If there's no objection, I'll update the patch to reflect this behavior.

pratlucas updated this revision to Diff 246418.Feb 25 2020, 5:08 AM

Removed restriction of branch instruction mnemonic for allowing $-prefixed
symbol names in operands, resolving ambriguity in favor of symbol names over
expression start.

There is currently a limitation on AsmParser's core causing $-prefixed symbol
names composed by numeric sequences not to be accepted. This will be handled in
a separate patch.
The test case for b $4 will be added to the separate patch accordingly.

efriedma added inline comments.Mar 4 2020, 1:34 PM
llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
6176

AsmToken::String? I don't see any testcase for that.

pratlucas updated this revision to Diff 248461.Mar 5 2020, 6:15 AM

Addressing review comment.

pratlucas marked 2 inline comments as done.Mar 5 2020, 6:18 AM
pratlucas added inline comments.
llvm/lib/Target/ARM/AsmParser/ARMAsmParser.cpp
6176

I removed AsmToken::String from the identifier handling in D75111 but forgot to update this condition. Sorry about that.

pratlucas updated this revision to Diff 248463.Mar 5 2020, 6:24 AM
pratlucas marked an inline comment as done.

Addressing review.

pratlucas updated this revision to Diff 248508.Mar 5 2020, 9:11 AM

Formatting patch according to clang-format and clang-tidy.

This revision is now accepted and ready to land.Mar 5 2020, 10:11 AM
This revision was automatically updated to reflect the committed changes.