Download Raw Diff

Details

Reviewers

sidneym
colinl
hjl.tools
rnk

Commits

rGf4903a3675a9: (clang part) Implement MASM-flavor intel syntax behavior for inline MS asm…
rG27ea29b3b749: (LLVM part) Implement MASM-flavor intel syntax behavior for inline MS asm block…
rC280556: (clang part) Implement MASM-flavor intel syntax behavior for inline MS asm…
rL280556: (clang part) Implement MASM-flavor intel syntax behavior for inline MS asm…
rL280555: (LLVM part) Implement MASM-flavor intel syntax behavior for inline MS asm block:

Summary

Fixing PR27884: allow llvm-mc to accept numbers like 0BEDh as a valid hexadecimal number.

Does this look like a reasonable approach?

Diff Detail

Event Timeline

ygao updated this revision to Diff 63136.Jul 7 2016, 2:29 PM

ygao retitled this revision from to Disambiguate a constant with both 0B prefix and H suffix..

ygao updated this object.

ygao added reviewers: sidneym, colinl.

ygao added a subscriber: llvm-commits.

Since the parsing style is specified on the command line at startup, can a flag be attached to MCAsmInfo which specifies whether prefix-bool parsing is activated? It looks like there are similar things already put there such as CommentString and DollarIsPC

Since the parsing style is specified on the command line at startup, can a
flag be attached to MCAsmInfo which specifies whether prefix-bool parsing is
activated? It looks like there are similar things already put there such as
CommentString and DollarIsPC

I did not find a command-line option to control the parsing style. I found a
"-x86-asm-syntax" option which controls only the output assembly dialect. Do
you have something specific in mind?

I think what I can do is to add a flag to MCAsmInfo, and then flip the value
of the flag upon seeing either the ".intel_syntax" or ".att_syntax" directive.
What do you think that the flag should control?
Option#1: whether the "0b" prefix should be supported.

Current behavior:
  "0b00" is binary;
  "0bed" is syntax error;
  "0b00h" is syntax error;
With "0b" prefix disabled, all three above are hexadecimal.

Option#2: whether the "h" suffix should be supported.
Option#3: when there is both the "0b" prefix and "h" suffix, which one wins.

David asked me what is the current behavior with "0777h". So,

"0777"  => octal number (decimal=511)
"0777h" => hexadecimal (decimal=1911)
"0x77h" => hexadecimal

So it appears that the "h" suffix trumps everything else at the moment.

The Option#1 above does not impact the octal numbers, but if we implement
option#2 and when the "h" suffix is disabled, "0777h" and "0x77h" would be
syntax errors.

It looks like in X86MCAsmInfo.cpp AsmWriterFlavor controls AssemblerDialect which is tied with [Intel|ATT]AsmParserVariant in X86.td

My gut feel is allowing both suffix and prefix to coexist with one winning out in certain circumstances is error prone in addition to expanding the accepted syntax beyond what other assemblers accept.

It seems like the MCAsmInfo flag would control whether the parser would accept either the prefix or postfix radix specifications though not both and when the Intel/ATT syntax variant is chosen it would pick the appropriate radix parsing function.

What do you think?

I think what you said makes a lot of sense to me... The practical difficulty
here is that if I actually disabled "0b" prefix under the Intel syntax, I would
be without a way to express binary numbers (sad face). Maybe I can implement
something? The Intel instruction manual talks of using [01]+[bB], but it looks
sufficiently similar to a backward label, and I could not figure out how the
Intel assembler actually tells them apart. For example,

.text
  1:  add rax, 20
.data
  .long   1b

In this example, it would appear ambiguous to me whether the "1b" in data
section is a numerical literal "1" or the address of the add instruction.

Hi, I am updating the patch based on the latest feedback on PR27884.
This patch attempts to distinguish between the case of parsing MS inline asm blocks and that of parsing GNU inline asm. And in the former case, it implements the MASM-flavor intel-assembly parsing. I am adding a flag to the AsmLexer class.
There are a handful of MS inline asm test in clang that I need to modify for this patch. This review will be committed in one LLVM patch and one clang patch.

A gentle ping.

colinl added inline comments.Aug 8 2016, 12:24 PM

lib/MC/MCParser/AsmLexer.cpp
261	Will CurPtr ever be at the beginning of the buffer, making index -1 invalid?
tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #66593)	What made these switch from printing hex to decimal?

ygao added inline comments.Aug 15 2016, 8:07 PM

lib/MC/MCParser/AsmLexer.cpp
261	I think, that is not possible. The only call site of LexDigit() is in AsmLexer::LexToken() in the same file. LexToken() calls getNextChar() to advance CurPtr before calling LexDigit(), so it is known that CurPtr[-1] will be [0-9]. AsmToken AsmLexer::LexToken() { TokStart = CurPtr; int CurChar = getNextChar(); // basically "CurChar = *CurPtr++;" ... switch (CurChar) { ... case [0-9]: return LexDigit(); // inside LexDigt(), CurPtr[-1] is "CurChar" here ... } } The original codes in this function, located several lines below, also checks "if (CurPtr[-1] != '0' ...)".
tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #66593)	I was curious about it my own self... X86AsmParser::ParseIntelOperand() makes this decision by comparing the following two sizes: (search for a comment "rewrite the complex expression as a single immediate" to locate the codes) size of the token, which is "next token position - current token position". e.g., given "0b0101U", the size would be 7. size of the string passed as the first argument to the constructor of intToken(). It is the "Result" string used in several places of the LexDigit() function. e.g., given "0b0101U", the "Result" string would be 6; the "U" suffix is not counted. If these two sizes are equal, the original expression is printed, otherwise the expression is rewritten as a decimal integer. In this case, "01010101h" will get rewritten with or without my changes, because the two sizes are 9 vs 8; on the other hand, my changes disallow "0x" prefix for MS-Intel inline assembly.

A gentle ping.

There's probably not a really clean way to do this with our one-lexer-rules-them-all model for llvm-mc but I think if we can keep hex output where it was before we at least didn't make things worse.

tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #66593)	It looks like this is under a call to isParsingInlineAsm() which will has the prefix disallowed; does the AOK_ImmPrefix rewrite ever get used at this point? It seems like we need a AOK_ImmSuffix rewrite and place it here which should return it to hex output but in the correct format for MS inline asm.

ygao updated this revision to Diff 69462.Aug 26 2016, 7:48 PM

ygao marked 2 inline comments as done.

ygao marked an inline comment as done.Aug 26 2016, 7:54 PM

ygao added inline comments.

tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #69462)	AOK_ImmPrefix is used in this case for a decimal integer without ignorable suffix (U or L), but in that case rewriting does not make a noticeable difference. I modified this test slightly to make sure we still strip out the ignorable suffix. While I do not think rewriting a hex number into decimal should really be considered wrong, it is fairly straightforward to get the hex output back by modifying LexDigit() to consider the [hHbB] suffix as part of the number (in the same spirit that the 0[xb] prefix is considered part of the number with the AT&T syntax).

LGTM

rnk requested changes to this revision.Aug 30 2016, 8:47 AM

rnk edited edge metadata.

rnk added inline comments.

lib/MC/MCParser/AsmLexer.cpp
354	This doesn't seem correct, MSVC accepts this code and returns 16: int main() { __asm mov eax, 0x10 }
tools/clang/test/CodeGen/ms-inline-asm.c
43 ↗	(On Diff #69462)	We should test both the 0xNN and NNh variants. Leave some of the 0x tests in place for now.

This revision now requires changes to proceed.Aug 30 2016, 8:47 AM

ygao updated this revision to Diff 69938.Aug 31 2016, 6:19 PM

ygao edited edge metadata.

ygao marked an inline comment as done.

ygao added inline comments.

lib/MC/MCParser/AsmLexer.cpp
336	You are right. Sigh. I was writing .asm files and testing them with the ml.exe/ml64.exe executables which are part of Visual Studio 2013. C:\> type test.asm .code mov ax, 0x0b00 END C:\> ml64 /FoMyObj test.asm test.asm(2) : error A2206:missing operator in expression Apparently, the stand-alone assembler behaves differently than the parser in cl.exe. I tested inline assembly again with cl.exe and updated the test cases accordingly. 0xNN => accepted 0xNN with U or L suffix => accepted NNh => accepted NNh with U or L suffix => accepted 0xNNh => rejected 0bNN => rejected NNb => accepted NNb with U or L suffix => accepted

Pretty close

lib/MC/MCParser/AsmLexer.cpp
274	These 'if' blocks should select the base, and then you can factor out the bodies, which are the same except for being in base16 and base2 and having a different diagnostic.

ygao updated this revision to Diff 70047.Sep 1 2016, 12:13 PM

ygao edited edge metadata.

ygao marked an inline comment as done.

ygao added inline comments.

lib/MC/MCParser/AsmLexer.cpp
274	Fixed. Hopefully.

lgtm, thanks!

lib/MC/MCParser/AsmLexer.cpp
285–286	We usually do `} else if (...) {`

This revision is now accepted and ready to land.Sep 1 2016, 12:31 PM

Closed by commit rL280555: (LLVM part) Implement MASM-flavor intel syntax behavior for inline MS asm block: (authored by ygao). · Explain WhySep 2 2016, 4:23 PM

This revision was automatically updated to reflect the committed changes.

ygao marked an inline comment as done.

This is an archive of the discontinued LLVM Phabricator instance.

Disambiguate a constant with both 0B prefix and H suffix.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 63136

lib/MC/MCParser/AsmLexer.cpp

test/MC/AsmParser/uppercase-hex.s

This is an archive of the discontinued LLVM Phabricator instance.

Disambiguate a constant with both 0B prefix and H suffix.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 63136

lib/MC/MCParser/AsmLexer.cpp

test/MC/AsmParser/uppercase-hex.s

Disambiguate a constant with both 0B prefix and H suffix.
ClosedPublic