This is an archive of the discontinued LLVM Phabricator instance.

Disambiguate a constant with both 0B prefix and H suffix.
ClosedPublic

Authored by ygao on Jul 7 2016, 2:29 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

ygao updated this revision to Diff 63136.Jul 7 2016, 2:29 PM
ygao retitled this revision from to Disambiguate a constant with both 0B prefix and H suffix..
ygao updated this object.
ygao added reviewers: sidneym, colinl.
ygao added a subscriber: llvm-commits.
colinl edited edge metadata.Jul 8 2016, 9:02 AM

Since the parsing style is specified on the command line at startup, can a flag be attached to MCAsmInfo which specifies whether prefix-bool parsing is activated? It looks like there are similar things already put there such as CommentString and DollarIsPC

ygao added a subscriber: majnemer.Jul 8 2016, 2:01 PM

Since the parsing style is specified on the command line at startup, can a
flag be attached to MCAsmInfo which specifies whether prefix-bool parsing is
activated? It looks like there are similar things already put there such as
CommentString and DollarIsPC

I did not find a command-line option to control the parsing style. I found a
"-x86-asm-syntax" option which controls only the output assembly dialect. Do
you have something specific in mind?

I think what I can do is to add a flag to MCAsmInfo, and then flip the value
of the flag upon seeing either the ".intel_syntax" or ".att_syntax" directive.
What do you think that the flag should control?
Option#1: whether the "0b" prefix should be supported.

Current behavior:
  "0b00" is binary;
  "0bed" is syntax error;
  "0b00h" is syntax error;
With "0b" prefix disabled, all three above are hexadecimal.

Option#2: whether the "h" suffix should be supported.
Option#3: when there is both the "0b" prefix and "h" suffix, which one wins.

David asked me what is the current behavior with "0777h". So,

"0777"  => octal number (decimal=511)
"0777h" => hexadecimal (decimal=1911)
"0x77h" => hexadecimal

So it appears that the "h" suffix trumps everything else at the moment.

The Option#1 above does not impact the octal numbers, but if we implement
option#2 and when the "h" suffix is disabled, "0777h" and "0x77h" would be
syntax errors.

colinl added a comment.Jul 8 2016, 2:19 PM

It looks like in X86MCAsmInfo.cpp AsmWriterFlavor controls AssemblerDialect which is tied with [Intel|ATT]AsmParserVariant in X86.td

My gut feel is allowing both suffix and prefix to coexist with one winning out in certain circumstances is error prone in addition to expanding the accepted syntax beyond what other assemblers accept.

It seems like the MCAsmInfo flag would control whether the parser would accept either the prefix or postfix radix specifications though not both and when the Intel/ATT syntax variant is chosen it would pick the appropriate radix parsing function.

What do you think?

ygao added a comment.Jul 11 2016, 11:25 AM

I think what you said makes a lot of sense to me... The practical difficulty
here is that if I actually disabled "0b" prefix under the Intel syntax, I would
be without a way to express binary numbers (sad face). Maybe I can implement
something? The Intel instruction manual talks of using [01]+[bB], but it looks
sufficiently similar to a backward label, and I could not figure out how the
Intel assembler actually tells them apart. For example,

.text
  1:  add rax, 20
.data
  .long   1b

In this example, it would appear ambiguous to me whether the "1b" in data
section is a numerical literal "1" or the address of the add instruction.

ygao updated this revision to Diff 66593.Aug 2 2016, 4:38 PM
ygao added reviewers: rnk, hjl.tools.

Hi, I am updating the patch based on the latest feedback on PR27884.
This patch attempts to distinguish between the case of parsing MS inline asm blocks and that of parsing GNU inline asm. And in the former case, it implements the MASM-flavor intel-assembly parsing. I am adding a flag to the AsmLexer class.
There are a handful of MS inline asm test in clang that I need to modify for this patch. This review will be committed in one LLVM patch and one clang patch.

ygao added a comment.Aug 8 2016, 12:15 PM

A gentle ping.

colinl added inline comments.Aug 8 2016, 12:24 PM
lib/MC/MCParser/AsmLexer.cpp
261 ↗(On Diff #66593)

Will CurPtr ever be at the beginning of the buffer, making index -1 invalid?

tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗(On Diff #66593)

What made these switch from printing hex to decimal?

ygao added inline comments.Aug 15 2016, 8:07 PM
lib/MC/MCParser/AsmLexer.cpp
261 ↗(On Diff #66593)

I think, that is not possible.

The only call site of LexDigit() is in AsmLexer::LexToken() in the same file.
LexToken() calls getNextChar() to advance CurPtr before calling LexDigit(), so
it is known that CurPtr[-1] will be [0-9].

AsmToken AsmLexer::LexToken() {
  TokStart = CurPtr;
  int CurChar = getNextChar(); // basically "CurChar = *CurPtr++;"
  ...
  switch (CurChar) {
    ...
    case [0-9]:
      return LexDigit(); // inside LexDigt(), CurPtr[-1] is "CurChar" here
    ...
  }
}

The original codes in this function, located several lines below, also checks "if (CurPtr[-1] != '0' ...)".

tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗(On Diff #66593)

I was curious about it my own self...

X86AsmParser::ParseIntelOperand() makes this decision by comparing the following
two sizes: (search for a comment "rewrite the complex expression as a single immediate" to
locate the codes)

  1. size of the token, which is "next token position - current token position". e.g., given "0b0101U", the size would be 7.
  2. size of the string passed as the first argument to the constructor of intToken(). It is the "Result" string used in several places of the LexDigit() function. e.g., given "0b0101U", the "Result" string would be 6; the "U" suffix is not counted.

If these two sizes are equal, the original expression is printed, otherwise the
expression is rewritten as a decimal integer. In this case, "01010101h" will get rewritten with or
without my changes, because the two sizes are 9 vs 8; on the other hand, my
changes disallow "0x" prefix for MS-Intel inline assembly.

ygao added a comment.Aug 23 2016, 4:33 PM

A gentle ping.

There's probably not a really clean way to do this with our one-lexer-rules-them-all model for llvm-mc but I think if we can keep hex output where it was before we at least didn't make things worse.

tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗(On Diff #66593)

It looks like this is under a call to isParsingInlineAsm() which will has the prefix disallowed; does the AOK_ImmPrefix rewrite ever get used at this point? It seems like we need a AOK_ImmSuffix rewrite and place it here which should return it to hex output but in the correct format for MS inline asm.

ygao updated this revision to Diff 69462.Aug 26 2016, 7:48 PM
ygao marked 2 inline comments as done.
ygao marked an inline comment as done.Aug 26 2016, 7:54 PM
ygao added inline comments.
tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗(On Diff #69462)

AOK_ImmPrefix is used in this case for a decimal integer without ignorable suffix (U or L), but in that case rewriting does not make a noticeable difference. I modified this test slightly to make sure we still strip out the ignorable suffix.
While I do not think rewriting a hex number into decimal should really be considered wrong, it is fairly straightforward to get the hex output back by modifying LexDigit() to consider the [hHbB] suffix as part of the number (in the same spirit that the 0[xb] prefix is considered part of the number with the AT&T syntax).

rnk requested changes to this revision.Aug 30 2016, 8:47 AM
rnk edited edge metadata.
rnk added inline comments.
lib/MC/MCParser/AsmLexer.cpp
354 ↗(On Diff #69462)

This doesn't seem correct, MSVC accepts this code and returns 16:

int main() {
  __asm mov eax, 0x10
}
tools/clang/test/CodeGen/ms-inline-asm.c
43 ↗(On Diff #69462)

We should test both the 0xNN and NNh variants. Leave some of the 0x tests in place for now.

This revision now requires changes to proceed.Aug 30 2016, 8:47 AM
ygao updated this revision to Diff 69938.Aug 31 2016, 6:19 PM
ygao edited edge metadata.
ygao marked an inline comment as done.
ygao added inline comments.
lib/MC/MCParser/AsmLexer.cpp
336 ↗(On Diff #69938)

You are right.
Sigh. I was writing .asm files and testing them with the ml.exe/ml64.exe
executables which are part of Visual Studio 2013.

C:\> type test.asm
  .code
    mov ax, 0x0b00
  END
C:\> ml64 /FoMyObj test.asm
  test.asm(2) : error A2206:missing operator in expression

Apparently, the stand-alone assembler behaves differently than the parser in cl.exe.

I tested inline assembly again with cl.exe and updated the test cases accordingly.

0xNN => accepted
0xNN with U or L suffix => accepted
NNh => accepted
NNh with U or L suffix => accepted
0xNNh => rejected
0bNN => rejected
NNb => accepted
NNb with U or L suffix => accepted

rnk added a comment.Sep 1 2016, 7:57 AM

Pretty close

lib/MC/MCParser/AsmLexer.cpp
274 ↗(On Diff #69938)

These 'if' blocks should select the base, and then you can factor out the bodies, which are the same except for being in base16 and base2 and having a different diagnostic.

ygao updated this revision to Diff 70047.Sep 1 2016, 12:13 PM
ygao edited edge metadata.
ygao marked an inline comment as done.
ygao added inline comments.
lib/MC/MCParser/AsmLexer.cpp
274 ↗(On Diff #70047)

Fixed. Hopefully.

rnk accepted this revision.Sep 1 2016, 12:31 PM
rnk edited edge metadata.

lgtm, thanks!

lib/MC/MCParser/AsmLexer.cpp
285–286 ↗(On Diff #70047)

We usually do } else if (...) {

This revision is now accepted and ready to land.Sep 1 2016, 12:31 PM
This revision was automatically updated to reflect the committed changes.
ygao marked an inline comment as done.