This is an archive of the discontinued LLVM Phabricator instance.

[AsmParser][ms][X86] Fix possible misbehaviour in parsing of special tokens at start of string.
ClosedPublic

Authored by anirudhp on Apr 5 2021, 9:58 AM.

Details

Summary
  • Previously, https://reviews.llvm.org/D72680 introduced a new attribute called AllowSymbolAtNameStart (in relation to the MAsmParser changes) in MCAsmInfo.h which (according to the comment in the header) allows the following behaviour:
/// This is true if the assembler allows $ @ ? characters at the start of
/// symbol names. Defaults to false.
  • However, the usage of this field in AsmLexer.cpp doesn't seem completely accurate* for a couple of reasons.
default:
  if (MAI.doesAllowSymbolAtNameStart()) {
    // Handle Microsoft-style identifier: [a-zA-Z_$.@?][a-zA-Z0-9_$.@#?]*
    if (!isDigit(CurChar) &&
        isIdentifierChar(CurChar, MAI.doesAllowAtInName(),
                         AllowHashInIdentifier))
      return LexIdentifier();
  }
  1. The Dollar and At tokens, when occurring at the start of the string, are treated as separate tokens (AsmToken::Dollar and AsmToken::At respectively) and not lexed as an Identifier.
  2. I'm not too sure why MAI.doesAllowAtInName() is used when AllowAtInIdentifier could be used. For X86 platforms, afaict, this shouldn't be an issue, since the CommentString attribute isn't "@". (alternatively the call to the setter can be set anywhere else as needed). The AllowAtInName does have an additional important meaning, but in the context of AsmLexer, shouldn't mean anything different compared to AllowAtInIdentifier

My proposal is the following:

  • Introduce 3 new fields called AllowQuestionTokenAtStartOfString, AllowDollarTokenAtStartOfString and AllowAtTokenAtStartOfString in MCAsmInfo.h which will encapsulate the previously documented behaviour of "allowing $, @, ? characters at the start of symbol names")
  • Introduce these fields where "$", "@" are lexed, and treat them as identifiers depending on whether Allow[Dollar|At]TokenAtStartOfString is set.
  • For the sole case of "?", append it to the existing logic for treating a "default" token as an Identifier.

z/OS (HLASM) will also make use of some of these fields in follow up patches.

completely accurate* - This was based on the comments and the intended behaviour the code. I might have completely misinterpreted it, and if that is the case my sincere apologies. We can close this patch if necessary, if there are no changes to be made :)

Depends on https://reviews.llvm.org/D99374

Diff Detail

Event Timeline

anirudhp created this revision.Apr 5 2021, 9:58 AM
anirudhp requested review of this revision.Apr 5 2021, 9:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2021, 9:58 AM
anirudhp retitled this revision from [AsmParser][ms][X86] Fix possible error in parsing of special tokens at start of string. to [AsmParser][ms][X86] Fix possible misbehaviour in parsing of special tokens at start of string..Apr 5 2021, 10:04 AM

I'm new to reviewing code here, so call me out if I mess up the etiquette. :)

Is it worth adding a non-backend-specific test in this patch too? It looks like this adds support for X86 identifiers starting with $ (I'm actually surprised this wasn't allowed before).

I have some reservations about having so many separate functions for specifying the identifier character set but to be honest I think it'll shake out if any further changes are needed in that area. (I guess you could probably add a flagset but that's probably premature at this stage.)

(Also, I landed my patch, so you're going to get a minor conflict when you rebase this. I apologise!)

llvm/include/llvm/MC/MCAsmInfo.h
617–626

It might be better to name these ...AtStartOfIdentifier (the previous nomenclature was ...AtNameStart, that works too).

Also nitpick: I would consider dropping Token from the names or using Symbol since I'd argue that these symbols never become tokens in this case. That is super pedantic though. :)

llvm/lib/MC/MCParser/AsmLexer.cpp
146

It's a bit awkward since this is a helper used elsewhere but if the comment represents all possible identifiers, then I think it needs $ & @ in the first character.

anirudhp added a comment.EditedApr 5 2021, 4:42 PM

It looks like this adds support for X86 identifiers starting with $ (I'm actually surprised this wasn't allowed before).

Well. Yes and no. If you take a look at the parseIdentifier routine in AsmParser.cpp, there is an edge case which covers when tokens start with @ and $, but there are conditions associated with that edge case. Namely, if you have a string which starts with [@|$] and the next token is an Identifier or an Integer, then you parse as an Identifier. So if you have something like $abc, and you make a call to the parseIdentifier routine, then "$abc" will be parsed as a "whole" Identifier, even though, the lexer lexes it as {AsmToken::At, AsmToken::Identifier}. But if you have something like "$$abc", then a call to the parseIdentifier routine will "fail" (and in this case the lexer lexes it as {AsmToken::At, AsmToken::At, AsmToken::Identifier}.

An argument could be made that the parseIdentifier routine could be modified or overridden wherever appropriate and given custom logic.

Part of this patch is also exploratory. The way I interpreted the comment allow $ @ ? characters at the start of symbol names. Defaults to false., told me that if "$", "@", "?" occurred at the start of the string, it should be lexed as part of an Identifier. However, now I don't know if that comment was meant to be written taking into account the parseIdentifier routine. In that case, the X86MCAsmInfo backend, should only set AllowQuestionTokenAtStartOfString to true. The other two are not "required" since parseIdentifier would "take care" of it. I'm also not too familiar with the ms semantics :)

anirudhp updated this revision to Diff 335600.Apr 6 2021, 11:14 AM
  • Addressed Comments (Rename of MCAsmInfo attribute)
anirudhp marked 2 inline comments as done.Apr 6 2021, 11:15 AM

It looks like this adds support for X86 identifiers starting with $ (I'm actually surprised this wasn't allowed before).

Well. Yes and no. If you take a look at the parseIdentifier routine in AsmParser.cpp...

Right, that makes sense!

An argument could be made that the parseIdentifier routine could be modified or overridden wherever appropriate and given custom logic.

The benefit being that target AsmParsers have the opportunity to parse the starting symbol as something else. The drawback being that stitching everything back together is more complicated (looking at the code, I'm not convinced that it'd parse $$ as an identifier, for example).

It doesn't look like this patch is going to affect other targets and it's simple.

I don't really have any other reservations. Probably worth getting a review from someone who knows the X86 backend better than me but otherwise LGTM.

anirudhp updated this revision to Diff 335656.Apr 6 2021, 2:02 PM
  • Applying same patch again.
anirudhp updated this revision to Diff 337190.Apr 13 2021, 9:44 AM
anirudhp set the repository for this revision to rG LLVM Github Monorepo.
  • Rebasing on latest master to account for the motorola integer changes that was merged recently
llvm/lib/MC/MCParser/AsmLexer.cpp
742–746

Previously, the Microsoft-style identifier allowed # (not as a starting character), but otherwise the identifier ([a-zA-Z_.][a-zA-Z0-9_$.@]*) did not. Should we be checking AllowHashInIdentifier?

llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp
544–545

Not sure what you're checking here that the CheckStrictCommentString* tests don't already check.

557

I think there should be a test for $ at the start of an identifier for completeness. I don't see one in SystemZAsmLexerTest.cpp but you may know of one somewhere else.

anirudhp added inline comments.Apr 13 2021, 12:18 PM
llvm/lib/MC/MCParser/AsmLexer.cpp
742–746

Previously, the Microsoft-style identifier allowed # (not as a starting character)

I wouldn't say that MS previously allowed "#" as a starting character, nor as part of the identifier. I don't believe the call to setAllowHashInIdentifier is set (apart from the eventual call to the setter for z/OS). LexIdentifier() eventually calls IsIdentifierChar with AllowAtInIdentifier and AllowHashInIdentifier.

(As an aside, "#" is special in most cases. In most cases, strings that start with "#" are already lexed as a possible comment)

llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp
544–545

The idea was to show that even under default circumstances, "#" is usually lexed as a default comment. But like you say its probably already covered in the CommentString tests. I'll look at it a bit deeply again, and update if needed.

557

Hmm you're right. I want to say there was a reason I didn't include the test for the "$" but honestly, now I don't remember. I'll go back and add the tests and if there was a problem / I remember why I didn't add it initially, I'll update this thread.

anirudhp updated this revision to Diff 337491.Apr 14 2021, 9:42 AM
  • Addressed comments (removed superfluous test, added a possible missing test)
anirudhp marked 3 inline comments as done.Apr 14 2021, 9:42 AM
anirudhp added inline comments.
llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp
544–545

Removed the test.

557

Updated the test.

This revision is now accepted and ready to land.Apr 16 2021, 11:47 AM

@epastor This changes some of the parsing logic that impacts masm. Would you please be able to provide some feedback! Thanks! :)

epastor added inline comments.Apr 19 2021, 11:30 AM
llvm/lib/MC/MCParser/AsmLexer.cpp
745

If we're handling @ in its own case below, shouldn't we also handle ? there as well?

Otherwise, looks good to me... and it's possible we can improve MASM-style lexing using this too.

anirudhp added inline comments.Apr 19 2021, 11:45 AM
llvm/lib/MC/MCParser/AsmLexer.cpp
745

shouldn't we also handle "?" there as well?

I don't believe so because "?" was never handled separately to begin with. It is already lexed "as part" of an Identifier, we just add on the additional logic to accept "?" at the start of the string as an Identifier as well.

anirudhp updated this revision to Diff 339003.Apr 20 2021, 2:42 PM
  • Rebasing on latest master.