Download Raw Diff

Details

Reviewers

sidneym
colinl
hjl.tools
rnk

Commits

rGf4903a3675a9: (clang part) Implement MASM-flavor intel syntax behavior for inline MS asm…
rG27ea29b3b749: (LLVM part) Implement MASM-flavor intel syntax behavior for inline MS asm block…
rC280556: (clang part) Implement MASM-flavor intel syntax behavior for inline MS asm…
rL280556: (clang part) Implement MASM-flavor intel syntax behavior for inline MS asm…
rL280555: (LLVM part) Implement MASM-flavor intel syntax behavior for inline MS asm block:

Summary

Fixing PR27884: allow llvm-mc to accept numbers like 0BEDh as a valid hexadecimal number.

Does this look like a reasonable approach?

Diff Detail

Repository: rL LLVM

Event Timeline

ygao updated this revision to Diff 63136.Jul 7 2016, 2:29 PM

ygao retitled this revision from to Disambiguate a constant with both 0B prefix and H suffix..

ygao updated this object.

ygao added reviewers: sidneym, colinl.

ygao added a subscriber: llvm-commits.

Since the parsing style is specified on the command line at startup, can a flag be attached to MCAsmInfo which specifies whether prefix-bool parsing is activated? It looks like there are similar things already put there such as CommentString and DollarIsPC

Since the parsing style is specified on the command line at startup, can a
flag be attached to MCAsmInfo which specifies whether prefix-bool parsing is
activated? It looks like there are similar things already put there such as
CommentString and DollarIsPC

I did not find a command-line option to control the parsing style. I found a
"-x86-asm-syntax" option which controls only the output assembly dialect. Do
you have something specific in mind?

I think what I can do is to add a flag to MCAsmInfo, and then flip the value
of the flag upon seeing either the ".intel_syntax" or ".att_syntax" directive.
What do you think that the flag should control?
Option#1: whether the "0b" prefix should be supported.

Current behavior:
  "0b00" is binary;
  "0bed" is syntax error;
  "0b00h" is syntax error;
With "0b" prefix disabled, all three above are hexadecimal.

Option#2: whether the "h" suffix should be supported.
Option#3: when there is both the "0b" prefix and "h" suffix, which one wins.

David asked me what is the current behavior with "0777h". So,

"0777"  => octal number (decimal=511)
"0777h" => hexadecimal (decimal=1911)
"0x77h" => hexadecimal

So it appears that the "h" suffix trumps everything else at the moment.

The Option#1 above does not impact the octal numbers, but if we implement
option#2 and when the "h" suffix is disabled, "0777h" and "0x77h" would be
syntax errors.

It looks like in X86MCAsmInfo.cpp AsmWriterFlavor controls AssemblerDialect which is tied with [Intel|ATT]AsmParserVariant in X86.td

My gut feel is allowing both suffix and prefix to coexist with one winning out in certain circumstances is error prone in addition to expanding the accepted syntax beyond what other assemblers accept.

It seems like the MCAsmInfo flag would control whether the parser would accept either the prefix or postfix radix specifications though not both and when the Intel/ATT syntax variant is chosen it would pick the appropriate radix parsing function.

What do you think?

I think what you said makes a lot of sense to me... The practical difficulty
here is that if I actually disabled "0b" prefix under the Intel syntax, I would
be without a way to express binary numbers (sad face). Maybe I can implement
something? The Intel instruction manual talks of using [01]+[bB], but it looks
sufficiently similar to a backward label, and I could not figure out how the
Intel assembler actually tells them apart. For example,

.text
  1:  add rax, 20
.data
  .long   1b

In this example, it would appear ambiguous to me whether the "1b" in data
section is a numerical literal "1" or the address of the add instruction.

Hi, I am updating the patch based on the latest feedback on PR27884.
This patch attempts to distinguish between the case of parsing MS inline asm blocks and that of parsing GNU inline asm. And in the former case, it implements the MASM-flavor intel-assembly parsing. I am adding a flag to the AsmLexer class.
There are a handful of MS inline asm test in clang that I need to modify for this patch. This review will be committed in one LLVM patch and one clang patch.

A gentle ping.

colinl added inline comments.Aug 8 2016, 12:24 PM

lib/MC/MCParser/AsmLexer.cpp
261 ↗	(On Diff #66593)	Will CurPtr ever be at the beginning of the buffer, making index -1 invalid?
tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #66593)	What made these switch from printing hex to decimal?

ygao added inline comments.Aug 15 2016, 8:07 PM

lib/MC/MCParser/AsmLexer.cpp
261 ↗	(On Diff #66593)	I think, that is not possible. The only call site of LexDigit() is in AsmLexer::LexToken() in the same file. LexToken() calls getNextChar() to advance CurPtr before calling LexDigit(), so it is known that CurPtr[-1] will be [0-9]. AsmToken AsmLexer::LexToken() { TokStart = CurPtr; int CurChar = getNextChar(); // basically "CurChar = *CurPtr++;" ... switch (CurChar) { ... case [0-9]: return LexDigit(); // inside LexDigt(), CurPtr[-1] is "CurChar" here ... } } The original codes in this function, located several lines below, also checks "if (CurPtr[-1] != '0' ...)".
tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #66593)	I was curious about it my own self... X86AsmParser::ParseIntelOperand() makes this decision by comparing the following two sizes: (search for a comment "rewrite the complex expression as a single immediate" to locate the codes) size of the token, which is "next token position - current token position". e.g., given "0b0101U", the size would be 7. size of the string passed as the first argument to the constructor of intToken(). It is the "Result" string used in several places of the LexDigit() function. e.g., given "0b0101U", the "Result" string would be 6; the "U" suffix is not counted. If these two sizes are equal, the original expression is printed, otherwise the expression is rewritten as a decimal integer. In this case, "01010101h" will get rewritten with or without my changes, because the two sizes are 9 vs 8; on the other hand, my changes disallow "0x" prefix for MS-Intel inline assembly.

A gentle ping.

There's probably not a really clean way to do this with our one-lexer-rules-them-all model for llvm-mc but I think if we can keep hex output where it was before we at least didn't make things worse.

tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #66593)	It looks like this is under a call to isParsingInlineAsm() which will has the prefix disallowed; does the AOK_ImmPrefix rewrite ever get used at this point? It seems like we need a AOK_ImmSuffix rewrite and place it here which should return it to hex output but in the correct format for MS inline asm.

ygao updated this revision to Diff 69462.Aug 26 2016, 7:48 PM

ygao marked 2 inline comments as done.

ygao marked an inline comment as done.Aug 26 2016, 7:54 PM

ygao added inline comments.

tools/clang/test/CodeGenCXX/ms-inline-asm-return.cpp
88 ↗	(On Diff #69462)	AOK_ImmPrefix is used in this case for a decimal integer without ignorable suffix (U or L), but in that case rewriting does not make a noticeable difference. I modified this test slightly to make sure we still strip out the ignorable suffix. While I do not think rewriting a hex number into decimal should really be considered wrong, it is fairly straightforward to get the hex output back by modifying LexDigit() to consider the [hHbB] suffix as part of the number (in the same spirit that the 0[xb] prefix is considered part of the number with the AT&T syntax).

LGTM

rnk requested changes to this revision.Aug 30 2016, 8:47 AM

rnk edited edge metadata.

rnk added inline comments.

lib/MC/MCParser/AsmLexer.cpp
354 ↗	(On Diff #69462)	This doesn't seem correct, MSVC accepts this code and returns 16: int main() { __asm mov eax, 0x10 }
tools/clang/test/CodeGen/ms-inline-asm.c
43 ↗	(On Diff #69462)	We should test both the 0xNN and NNh variants. Leave some of the 0x tests in place for now.

This revision now requires changes to proceed.Aug 30 2016, 8:47 AM

ygao updated this revision to Diff 69938.Aug 31 2016, 6:19 PM

ygao edited edge metadata.

ygao marked an inline comment as done.

ygao added inline comments.

lib/MC/MCParser/AsmLexer.cpp
336 ↗	(On Diff #69938)	You are right. Sigh. I was writing .asm files and testing them with the ml.exe/ml64.exe executables which are part of Visual Studio 2013. C:\> type test.asm .code mov ax, 0x0b00 END C:\> ml64 /FoMyObj test.asm test.asm(2) : error A2206:missing operator in expression Apparently, the stand-alone assembler behaves differently than the parser in cl.exe. I tested inline assembly again with cl.exe and updated the test cases accordingly. 0xNN => accepted 0xNN with U or L suffix => accepted NNh => accepted NNh with U or L suffix => accepted 0xNNh => rejected 0bNN => rejected NNb => accepted NNb with U or L suffix => accepted

Pretty close

lib/MC/MCParser/AsmLexer.cpp
274 ↗	(On Diff #69938)	These 'if' blocks should select the base, and then you can factor out the bodies, which are the same except for being in base16 and base2 and having a different diagnostic.

ygao updated this revision to Diff 70047.Sep 1 2016, 12:13 PM

ygao edited edge metadata.

ygao marked an inline comment as done.

ygao added inline comments.

lib/MC/MCParser/AsmLexer.cpp
274 ↗	(On Diff #70047)	Fixed. Hopefully.

lgtm, thanks!

lib/MC/MCParser/AsmLexer.cpp
285–286 ↗	(On Diff #70047)	We usually do `} else if (...) {`

This revision is now accepted and ready to land.Sep 1 2016, 12:31 PM

Closed by commit rL280555: (LLVM part) Implement MASM-flavor intel syntax behavior for inline MS asm block: (authored by ygao). · Explain WhySep 2 2016, 4:23 PM

This revision was automatically updated to reflect the committed changes.

ygao marked an inline comment as done.

Diff 70254

llvm/trunk/include/llvm/MC/MCParser/AsmLexer.h

	Show All 25 Lines
	/// AsmLexer - Lexer class for assembly files.			/// AsmLexer - Lexer class for assembly files.
	class AsmLexer : public MCAsmLexer {			class AsmLexer : public MCAsmLexer {
	const MCAsmInfo &MAI;			const MCAsmInfo &MAI;

	const char *CurPtr;			const char *CurPtr;
	StringRef CurBuf;			StringRef CurBuf;
	bool IsAtStartOfLine;			bool IsAtStartOfLine;
	bool IsAtStartOfStatement;			bool IsAtStartOfStatement;
				bool IsParsingMSInlineAsm;

	void operator=(const AsmLexer&) = delete;			void operator=(const AsmLexer&) = delete;
	AsmLexer(const AsmLexer&) = delete;			AsmLexer(const AsmLexer&) = delete;

	protected:			protected:
	/// LexToken - Read the next token and return its code.			/// LexToken - Read the next token and return its code.
	AsmToken LexToken() override;			AsmToken LexToken() override;

	public:			public:
	AsmLexer(const MCAsmInfo &MAI);			AsmLexer(const MCAsmInfo &MAI);
	~AsmLexer() override;			~AsmLexer() override;

	void setBuffer(StringRef Buf, const char *ptr = nullptr);			void setBuffer(StringRef Buf, const char *ptr = nullptr);
				void setParsingMSInlineAsm(bool V) { IsParsingMSInlineAsm = V; }

	StringRef LexUntilEndOfStatement() override;			StringRef LexUntilEndOfStatement() override;

	size_t peekTokens(MutableArrayRef<AsmToken> Buf,			size_t peekTokens(MutableArrayRef<AsmToken> Buf,
	bool ShouldSkipSpace = true) override;			bool ShouldSkipSpace = true) override;

	const MCAsmInfo &getMAI() const { return MAI; }			const MCAsmInfo &getMAI() const { return MAI; }

	Show All 21 Lines

llvm/trunk/lib/MC/MCParser/AsmLexer.cpp

Show All 27 Lines
#include <utility>		#include <utility>

using namespace llvm;		using namespace llvm;

AsmLexer::AsmLexer(const MCAsmInfo &MAI) : MAI(MAI) {		AsmLexer::AsmLexer(const MCAsmInfo &MAI) : MAI(MAI) {
CurPtr = nullptr;		CurPtr = nullptr;
IsAtStartOfLine = true;		IsAtStartOfLine = true;
IsAtStartOfStatement = true;		IsAtStartOfStatement = true;
		IsParsingMSInlineAsm = false;
AllowAtInIdentifier = !StringRef(MAI.getCommentString()).startswith("@");		AllowAtInIdentifier = !StringRef(MAI.getCommentString()).startswith("@");
}		}

AsmLexer::~AsmLexer() {		AsmLexer::~AsmLexer() {
}		}

void AsmLexer::setBuffer(StringRef Buf, const char *ptr) {		void AsmLexer::setBuffer(StringRef Buf, const char *ptr) {
CurBuf = Buf;		CurBuf = Buf;
▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines
/// LexDigit: First character is [0-9].		/// LexDigit: First character is [0-9].
/// Local Label: [0-9][:]		/// Local Label: [0-9][:]
/// Forward/Backward Label: [0-9][fb]		/// Forward/Backward Label: [0-9][fb]
/// Binary integer: 0b[01]+		/// Binary integer: 0b[01]+
/// Octal integer: 0[0-7]+		/// Octal integer: 0[0-7]+
/// Hex integer: 0x[0-9a-fA-F]+ or [0x]?[0-9][0-9a-fA-F]*[hH]		/// Hex integer: 0x[0-9a-fA-F]+ or [0x]?[0-9][0-9a-fA-F]*[hH]
/// Decimal integer: [1-9][0-9]*		/// Decimal integer: [1-9][0-9]*
AsmToken AsmLexer::LexDigit() {		AsmToken AsmLexer::LexDigit() {
		// MASM-flavor binary integer: [01]+[bB]
		// MASM-flavor hexadecimal integer: [0-9][0-9a-fA-F]*[hH]
		if (IsParsingMSInlineAsm && isdigit(CurPtr[-1])) {
		const char *FirstNonBinary = (CurPtr[-1] != '0' && CurPtr[-1] != '1') ?
		CurPtr - 1 : nullptr;
		const char *OldCurPtr = CurPtr;
		while (isxdigit(*CurPtr)) {
		if (CurPtr != '0' && CurPtr != '1' && !FirstNonBinary)
		FirstNonBinary = CurPtr;
		++CurPtr;
		}

		unsigned Radix = 0;
		if (CurPtr == 'h' \|\| CurPtr == 'H') {
		// hexadecimal number
		++CurPtr;
		Radix = 16;
		} else if (FirstNonBinary && FirstNonBinary + 1 == CurPtr &&
		(FirstNonBinary == 'b' \|\| FirstNonBinary == 'B'))
		Radix = 2;

		if (Radix == 2 \|\| Radix == 16) {
		StringRef Result(TokStart, CurPtr - TokStart);
		APInt Value(128, 0, true);

		if (Result.drop_back().getAsInteger(Radix, Value))
		return ReturnError(TokStart, Radix == 2 ? "invalid binary number" :
		"invalid hexdecimal number");

		// MSVC accepts and ignores type suffices on integer literals.
		SkipIgnoredIntegerSuffix(CurPtr);

		return intToken(Result, Value);
		}

		// octal/decimal integers, or floating point numbers, fall through
		CurPtr = OldCurPtr;
		}

// Decimal integer: [1-9][0-9]*		// Decimal integer: [1-9][0-9]*
if (CurPtr[-1] != '0' \|\| CurPtr[0] == '.') {		if (CurPtr[-1] != '0' \|\| CurPtr[0] == '.') {
unsigned Radix = doLookAhead(CurPtr, 10);		unsigned Radix = doLookAhead(CurPtr, 10);
bool isHex = Radix == 16;		bool isHex = Radix == 16;
// Check for floating point literals.		// Check for floating point literals.
if (!isHex && (CurPtr == '.' \|\| CurPtr == 'e')) {		if (!isHex && (CurPtr == '.' \|\| CurPtr == 'e')) {
++CurPtr;		++CurPtr;
return LexFloatLiteral();		return LexFloatLiteral();
Show All 12 Lines	if (CurPtr[-1] != '0' \|\| CurPtr[0] == '.') {

// The darwin/x86 (and x86-64) assembler accepts and ignores type		// The darwin/x86 (and x86-64) assembler accepts and ignores type
// suffices on integer literals.		// suffices on integer literals.
SkipIgnoredIntegerSuffix(CurPtr);		SkipIgnoredIntegerSuffix(CurPtr);

return intToken(Result, Value);		return intToken(Result, Value);
}		}

if ((CurPtr == 'b') \|\| (CurPtr == 'B')) {		if (!IsParsingMSInlineAsm && ((CurPtr == 'b') \|\| (CurPtr == 'B'))) {
++CurPtr;		++CurPtr;
// See if we actually have "0b" as part of something like "jmp 0b\n"		// See if we actually have "0b" as part of something like "jmp 0b\n"
if (!isdigit(CurPtr[0])) {		if (!isdigit(CurPtr[0])) {
--CurPtr;		--CurPtr;
StringRef Result(TokStart, CurPtr - TokStart);		StringRef Result(TokStart, CurPtr - TokStart);
return AsmToken(AsmToken::Integer, Result, 0);		return AsmToken(AsmToken::Integer, Result, 0);
}		}
const char *NumStart = CurPtr;		const char *NumStart = CurPtr;
Show All 32 Lines	if ((CurPtr == 'x') \|\| (CurPtr == 'X')) {
if (CurPtr == NumStart)		if (CurPtr == NumStart)
return ReturnError(CurPtr-2, "invalid hexadecimal number");		return ReturnError(CurPtr-2, "invalid hexadecimal number");

APInt Result(128, 0);		APInt Result(128, 0);
if (StringRef(TokStart, CurPtr - TokStart).getAsInteger(0, Result))		if (StringRef(TokStart, CurPtr - TokStart).getAsInteger(0, Result))
return ReturnError(TokStart, "invalid hexadecimal number");		return ReturnError(TokStart, "invalid hexadecimal number");

// Consume the optional [hH].		// Consume the optional [hH].
if (CurPtr == 'h' \|\| CurPtr == 'H')		if (!IsParsingMSInlineAsm && (CurPtr == 'h' \|\| CurPtr == 'H'))
++CurPtr;		++CurPtr;

// The darwin/x86 (and x86-64) assembler accepts and ignores ULL and LL		// The darwin/x86 (and x86-64) assembler accepts and ignores ULL and LL
// suffixes on integer literals.		// suffixes on integer literals.
SkipIgnoredIntegerSuffix(CurPtr);		SkipIgnoredIntegerSuffix(CurPtr);

return intToken(StringRef(TokStart, CurPtr - TokStart), Result);		return intToken(StringRef(TokStart, CurPtr - TokStart), Result);
}		}
▲ Show 20 Lines • Show All 339 Lines • Show Last 20 Lines

llvm/trunk/lib/MC/MCParser/AsmParser.cpp

Show First 20 Lines • Show All 250 Lines • ▼ Show 20 Lines	void Note(SMLoc L, const Twine &Msg,
ArrayRef<SMRange> Ranges = None) override;		ArrayRef<SMRange> Ranges = None) override;
bool Warning(SMLoc L, const Twine &Msg,		bool Warning(SMLoc L, const Twine &Msg,
ArrayRef<SMRange> Ranges = None) override;		ArrayRef<SMRange> Ranges = None) override;
bool Error(SMLoc L, const Twine &Msg,		bool Error(SMLoc L, const Twine &Msg,
ArrayRef<SMRange> Ranges = None) override;		ArrayRef<SMRange> Ranges = None) override;

const AsmToken &Lex() override;		const AsmToken &Lex() override;

void setParsingInlineAsm(bool V) override { ParsingInlineAsm = V; }		void setParsingInlineAsm(bool V) override {
		ParsingInlineAsm = V;
		Lexer.setParsingMSInlineAsm(V);
		}
bool isParsingInlineAsm() override { return ParsingInlineAsm; }		bool isParsingInlineAsm() override { return ParsingInlineAsm; }

bool parseMSInlineAsm(void *AsmLoc, std::string &AsmString,		bool parseMSInlineAsm(void *AsmLoc, std::string &AsmString,
unsigned &NumOutputs, unsigned &NumInputs,		unsigned &NumOutputs, unsigned &NumInputs,
SmallVectorImpl<std::pair<void *,bool> > &OpDecls,		SmallVectorImpl<std::pair<void *,bool> > &OpDecls,
SmallVectorImpl<std::string> &Constraints,		SmallVectorImpl<std::string> &Constraints,
SmallVectorImpl<std::string> &Clobbers,		SmallVectorImpl<std::string> &Clobbers,
const MCInstrInfo MII, const MCInstPrinter IP,		const MCInstrInfo MII, const MCInstPrinter IP,
▲ Show 20 Lines • Show All 5,046 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Disambiguate a constant with both 0B prefix and H suffix.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 70254

llvm/trunk/include/llvm/MC/MCParser/AsmLexer.h

llvm/trunk/lib/MC/MCParser/AsmLexer.cpp

llvm/trunk/lib/MC/MCParser/AsmParser.cpp

This is an archive of the discontinued LLVM Phabricator instance.

Disambiguate a constant with both 0B prefix and H suffix.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 70254

llvm/trunk/include/llvm/MC/MCParser/AsmLexer.h

llvm/trunk/lib/MC/MCParser/AsmLexer.cpp

llvm/trunk/lib/MC/MCParser/AsmParser.cpp

Disambiguate a constant with both 0B prefix and H suffix.
ClosedPublic