This is an archive of the discontinued LLVM Phabricator instance.

[MC] AsmLexer: add extensible identifier's character set support.
Needs RevisionPublic

Authored by vpykhtin on Feb 16 2016, 8:16 AM.

Download Raw Diff

Details

Reviewers

grosbach
• tstellarAMD
arsenm

Summary

Working on AMDGPU project I need to support assembler identifiers started with '&'. Looking into AsmLexer.cpp I found similar need to optionally support '@' inside identifiers and decided this is time to add generic support for identifier charset. I added configurable bitvector set for prefix and body identifier's characters.

Diff Detail

Repository: rL LLVM

Event Timeline

vpykhtin updated this revision to Diff 48072.Feb 16 2016, 8:16 AM

vpykhtin retitled this revision from to [MC] AsmLexer: 30% speedup on tests, added extensible identifier's character set support..

vpykhtin updated this object.

vpykhtin added reviewers: grosbach, arsenm, • ddunbar.

vpykhtin set the repository for this revision to rL LLVM.

vpykhtin added a project: Restricted Project.

vpykhtin added a subscriber: nhaustov.

vpykhtin added a reviewer: • tstellarAMD.Feb 25 2016, 5:45 AM

• tstellarAMD added a subscriber: llvm-commits.Feb 25 2016, 2:41 PM

Kind reminder if someone can take a look at this.

• rafael added a subscriber: • rafael.Mar 1 2016, 11:12 AM

• rafael added inline comments.

include/llvm/MC/MCParser/MCAsmLexer.h
211	Why do you need to make these virtual?
216	is..Contains is a strange name since it has two verbs.
lib/MC/MCParser/AsmLexer.cpp
23	Why?

vpykhtin added inline comments.Mar 1 2016, 11:34 AM

include/llvm/MC/MCParser/MCAsmLexer.h
211	Well not making it virtual would require bitvector sets to be part of this class. I'm not objecting though as it already done with SkipSpace and AllowAtInIdentifier.
216	What would be a better name here?
lib/MC/MCParser/AsmLexer.cpp
23	Well it based on my previuos experience on Windows where we had lexer using these routines eating up to 10% of scan time. Probably not so "generally" as I stated though. I'm not insisting on this particular change and can remove it.

• ddunbar resigned from this revision.Sep 1 2016, 8:26 PM

• ddunbar removed a reviewer: • ddunbar.

After a loooong time I would like to reanimate this review requiest.

Previously I incorrectly measured performance impact for this patch and obtained 30% performance gain - this result was incorrect. Current measurement on a large .s file shows no affect on parsing performance.

Herald edited edge metadata. · View Herald TranscriptNov 18 2016, 6:24 AM

Herald added subscribers: nhaehnle, wdng. · View Herald Transcript

ping

ping.

Last ping?

grosbach requested changes to this revision.Dec 9 2016, 2:58 PM

grosbach edited edge metadata.

grosbach added inline comments.

include/llvm/MC/MCParser/MCAsmLexer.h
211	With the generalization, these can go away entirely, yes? Replace the callsites w/ the new API.
219	This should start with "is" not "Is" according the the coding guidelines.
224	Ditto.
231	This feels really weird. Wouldn't any callsites want to be using one of the other two? They'll know their context. I don't see any invocations of this method in the patch. Why is it needed at all?
lib/MC/MCParser/AsmLexer.cpp
568	Can you elaborate on this bit? Not sure I follow why this is so much more logic than previously.
lib/MC/MCParser/MCAsmLexer.cpp
24	Given the bimodal behaviour based on Value, this should probably just be two functions.

This revision now requires changes to proceed.Dec 9 2016, 2:58 PM

arsenm resigned from this revision.Apr 5 2020, 8:26 AM

Herald added subscribers: kerbowa, tpr, jvesely, arsenm. · View Herald TranscriptApr 5 2020, 8:26 AM

Revision Contents

Path

Size

include/

llvm/

MC/

MCParser/

AsmLexer.h

12 lines

MCAsmLexer.h

13 lines

lib/

MC/

MCParser/

AsmLexer.cpp

125 lines

MCAsmLexer.cpp

5 lines

Diff 48072

include/llvm/MC/MCParser/AsmLexer.h

//===- AsmLexer.h - Lexer for Assembly Files --------------------- C++ --===//		//===- AsmLexer.h - Lexer for Assembly Files --------------------- C++ --===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This class declares the lexer for assembly files.		// This class declares the lexer for assembly files.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_MC_MCPARSER_ASMLEXER_H		#ifndef LLVM_MC_MCPARSER_ASMLEXER_H
#define LLVM_MC_MCPARSER_ASMLEXER_H		#define LLVM_MC_MCPARSER_ASMLEXER_H

		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/MC/MCParser/MCAsmLexer.h"		#include "llvm/MC/MCParser/MCAsmLexer.h"
#include "llvm/Support/DataTypes.h"		#include "llvm/Support/DataTypes.h"
#include <string>		#include <string>

namespace llvm {		namespace llvm {
class MemoryBuffer;		class MemoryBuffer;
class MCAsmInfo;		class MCAsmInfo;

/// AsmLexer - Lexer class for assembly files.		/// AsmLexer - Lexer class for assembly files.
class AsmLexer : public MCAsmLexer {		class AsmLexer : public MCAsmLexer {
const MCAsmInfo &MAI;		const MCAsmInfo &MAI;

const char *CurPtr;		const char *CurPtr;
StringRef CurBuf;		StringRef CurBuf;
bool isAtStartOfLine;		bool isAtStartOfLine;

		SmallBitVector IdPrefixCharSet;
		SmallBitVector IdBodyCharSet;

void operator=(const AsmLexer&) = delete;		void operator=(const AsmLexer&) = delete;
AsmLexer(const AsmLexer&) = delete;		AsmLexer(const AsmLexer&) = delete;

protected:		protected:
/// LexToken - Read the next token and return its code.		/// LexToken - Read the next token and return its code.
AsmToken LexToken() override;		AsmToken LexToken() override;

public:		public:
Show All 12 Lines	public:
bool isAtStatementSeparator(const char *Ptr);		bool isAtStatementSeparator(const char *Ptr);

const MCAsmInfo &getMAI() const { return MAI; }		const MCAsmInfo &getMAI() const { return MAI; }

private:		private:
int getNextChar();		int getNextChar();
AsmToken ReturnError(const char *Loc, const std::string &Msg);		AsmToken ReturnError(const char *Loc, const std::string &Msg);

		bool IsIdentifierPrefixChar(char c) const;
		bool IsIdentifierBodyChar(char c) const;

		void setIdentifierCharSet(bool Value,
		StringRef PfxCharSet,
		StringRef BodyCharSet) override;
		bool isIdentifierCharSetContains(char) const override;

AsmToken LexIdentifier();		AsmToken LexIdentifier();
AsmToken LexSlash();		AsmToken LexSlash();
AsmToken LexLineComment();		AsmToken LexLineComment();
AsmToken LexDigit();		AsmToken LexDigit();
AsmToken LexSingleQuote();		AsmToken LexSingleQuote();
AsmToken LexQuote();		AsmToken LexQuote();
AsmToken LexFloatLiteral();		AsmToken LexFloatLiteral();
AsmToken LexHexFloatLiteral(bool NoIntDigits);		AsmToken LexHexFloatLiteral(bool NoIntDigits);
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

include/llvm/MC/MCParser/MCAsmLexer.h

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	class MCAsmLexer {
SMLoc ErrLoc;		SMLoc ErrLoc;
std::string Err;		std::string Err;

MCAsmLexer(const MCAsmLexer &) = delete;		MCAsmLexer(const MCAsmLexer &) = delete;
void operator=(const MCAsmLexer &) = delete;		void operator=(const MCAsmLexer &) = delete;
protected: // Can only create subclasses.		protected: // Can only create subclasses.
const char *TokStart;		const char *TokStart;
bool SkipSpace;		bool SkipSpace;
bool AllowAtInIdentifier;

MCAsmLexer();		MCAsmLexer();

virtual AsmToken LexToken() = 0;		virtual AsmToken LexToken() = 0;

void SetError(SMLoc errLoc, const std::string &err) {		void SetError(SMLoc errLoc, const std::string &err) {
ErrLoc = errLoc;		ErrLoc = errLoc;
Err = err;		Err = err;
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	public:
bool is(AsmToken::TokenKind K) const { return getTok().is(K); }		bool is(AsmToken::TokenKind K) const { return getTok().is(K); }

/// Check if the current token has kind \p K.		/// Check if the current token has kind \p K.
bool isNot(AsmToken::TokenKind K) const { return getTok().isNot(K); }		bool isNot(AsmToken::TokenKind K) const { return getTok().isNot(K); }

/// Set whether spaces should be ignored by the lexer		/// Set whether spaces should be ignored by the lexer
void setSkipSpace(bool val) { SkipSpace = val; }		void setSkipSpace(bool val) { SkipSpace = val; }

bool getAllowAtInIdentifier() { return AllowAtInIdentifier; }		/// allow/disallow an identifier to contain specified characters
void setAllowAtInIdentifier(bool v) { AllowAtInIdentifier = v; }		virtual void setIdentifierCharSet(bool Value,
		rafaelUnsubmitted Not Done Reply Inline Actions Why do you need to make these virtual? rafael: Why do you need to make these virtual?
		vpykhtinAuthorUnsubmitted Not Done Reply Inline Actions Well not making it virtual would require bitvector sets to be part of this class. I'm not objecting though as it already done with SkipSpace and AllowAtInIdentifier. vpykhtin: Well not making it virtual would require bitvector sets to be part of this class. I'm not…
		grosbachUnsubmitted Not Done Reply Inline Actions With the generalization, these can go away entirely, yes? Replace the callsites w/ the new API. grosbach: With the generalization, these can go away entirely, yes? Replace the callsites w/ the new API.
		StringRef PfxCharSet,
		StringRef BodyCharSet);

		/// test whether the specified character can be found in an identifier
		virtual bool isIdentifierCharSetContains(char) const = 0;
		rafaelUnsubmitted Not Done Reply Inline Actions is..Contains is a strange name since it has two verbs. rafael: is..Contains is a strange name since it has two verbs.
		vpykhtinAuthorUnsubmitted Not Done Reply Inline Actions What would be a better name here? vpykhtin: What would be a better name here?

		bool getAllowAtInIdentifier() { return isIdentifierCharSetContains('@'); }
		void setAllowAtInIdentifier(bool v) { setIdentifierCharSet(v, "", "@"); }
		grosbachUnsubmitted Not Done Reply Inline Actions This should start with "is" not "Is" according the the coding guidelines. grosbach: This should start with "is" not "Is" according the the coding guidelines.
};		};

} // End llvm namespace		} // End llvm namespace

#endif		#endif
		grosbachUnsubmitted Not Done Reply Inline Actions Ditto. grosbach: Ditto.
		grosbachUnsubmitted Not Done Reply Inline Actions This feels really weird. Wouldn't any callsites want to be using one of the other two? They'll know their context. I don't see any invocations of this method in the patch. Why is it needed at all? grosbach: This feels really weird. Wouldn't any callsites want to be using one of the other two? They'll…

lib/MC/MCParser/AsmLexer.cpp

Show All 9 Lines
// This class implements the lexer for assembly files.		// This class implements the lexer for assembly files.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/MC/MCParser/AsmLexer.h"		#include "llvm/MC/MCParser/AsmLexer.h"
#include "llvm/MC/MCAsmInfo.h"		#include "llvm/MC/MCAsmInfo.h"
#include "llvm/Support/MemoryBuffer.h"		#include "llvm/Support/MemoryBuffer.h"
#include "llvm/Support/SMLoc.h"		#include "llvm/Support/SMLoc.h"
#include <cctype>
#include <cerrno>		#include <cerrno>
#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
using namespace llvm;		using namespace llvm;

		// standard is(x)digit generally much slower than simple
		rafaelUnsubmitted Not Done Reply Inline Actions Why? rafael: Why?
		vpykhtinAuthorUnsubmitted Not Done Reply Inline Actions Well it based on my previuos experience on Windows where we had lexer using these routines eating up to 10% of scan time. Probably not so "generally" as I stated though. I'm not insisting on this particular change and can remove it. vpykhtin: Well it based on my previuos experience on Windows where we had lexer using these routines…
		// checks like below
		inline static bool isDigit(char C) {
		return (C >= '0' && C <= '9');
		}

		inline static bool isHexDigit(char C) {
		return isDigit(C)
		\|\| (C >= 'A' && C <= 'F')
		\|\| (C >= 'a' && C <= 'f');
		}

AsmLexer::AsmLexer(const MCAsmInfo &MAI) : MAI(MAI) {		AsmLexer::AsmLexer(const MCAsmInfo &MAI) : MAI(MAI) {
CurPtr = nullptr;		CurPtr = nullptr;
isAtStartOfLine = true;		isAtStartOfLine = true;
AllowAtInIdentifier = !StringRef(MAI.getCommentString()).startswith("@");
		IdPrefixCharSet.resize(256);
		IdPrefixCharSet.set('a', 'z' + 1);
		IdPrefixCharSet.set('A', 'Z' + 1);
		IdPrefixCharSet.set('.');
		IdPrefixCharSet.set('_');

		IdBodyCharSet = IdPrefixCharSet;
		IdBodyCharSet.set('0', '9' + 1);
		IdBodyCharSet.set('$');
		IdBodyCharSet.set('?');

		if (!StringRef(MAI.getCommentString()).startswith("@"))
		IdBodyCharSet.set('@');
}		}

AsmLexer::~AsmLexer() {		AsmLexer::~AsmLexer() {
}		}

		void AsmLexer::setIdentifierCharSet(bool Value,
		StringRef PfxCharSet,
		StringRef BodyCharSet) {
		if (Value) {
		for (auto C : PfxCharSet)
		IdPrefixCharSet.set((unsigned char)C);
		for (auto C : BodyCharSet)
		IdBodyCharSet.set((unsigned char)C);
		} else {
		for (auto C : PfxCharSet)
		IdPrefixCharSet.reset((unsigned char)C);
		for (auto C : BodyCharSet)
		IdBodyCharSet.reset((unsigned char)C);
		}
		}

		inline bool AsmLexer::IsIdentifierPrefixChar(char C) const {
		return IdPrefixCharSet.test((unsigned char)C);
		}

		inline bool AsmLexer::IsIdentifierBodyChar(char C) const {
		return IdBodyCharSet.test((unsigned char)C);
		}

		bool AsmLexer::isIdentifierCharSetContains(char C) const {
		return IsIdentifierBodyChar(C) \|\| IsIdentifierPrefixChar(C);
		}

void AsmLexer::setBuffer(StringRef Buf, const char *ptr) {		void AsmLexer::setBuffer(StringRef Buf, const char *ptr) {
CurBuf = Buf;		CurBuf = Buf;

if (ptr)		if (ptr)
CurPtr = ptr;		CurPtr = ptr;
else		else
CurPtr = CurBuf.begin();		CurPtr = CurBuf.begin();

Show All 27 Lines

/// LexFloatLiteral: [0-9][.][0-9]([eE][+-]?[0-9]*)?		/// LexFloatLiteral: [0-9][.][0-9]([eE][+-]?[0-9]*)?
///		///
/// The leading integral digit sequence and dot should have already been		/// The leading integral digit sequence and dot should have already been
/// consumed, some or all of the fractional digit sequence can have been		/// consumed, some or all of the fractional digit sequence can have been
/// consumed.		/// consumed.
AsmToken AsmLexer::LexFloatLiteral() {		AsmToken AsmLexer::LexFloatLiteral() {
// Skip the fractional digit sequence.		// Skip the fractional digit sequence.
while (isdigit(*CurPtr))		while (isDigit(*CurPtr))
++CurPtr;		++CurPtr;

// Check for exponent; we intentionally accept a slighlty wider set of		// Check for exponent; we intentionally accept a slighlty wider set of
// literals here and rely on the upstream client to reject invalid ones (e.g.,		// literals here and rely on the upstream client to reject invalid ones (e.g.,
// "1e+").		// "1e+").
if (CurPtr == 'e' \|\| CurPtr == 'E') {		if (CurPtr == 'e' \|\| CurPtr == 'E') {
++CurPtr;		++CurPtr;
if (CurPtr == '-' \|\| CurPtr == '+')		if (CurPtr == '-' \|\| CurPtr == '+')
++CurPtr;		++CurPtr;
while (isdigit(*CurPtr))		while (isDigit(*CurPtr))
++CurPtr;		++CurPtr;
}		}

return AsmToken(AsmToken::Real,		return AsmToken(AsmToken::Real,
StringRef(TokStart, CurPtr - TokStart));		StringRef(TokStart, CurPtr - TokStart));
}		}

/// LexHexFloatLiteral matches essentially (.[0-9a-fA-F]*)?[pP][+-]?[0-9a-fA-F]+		/// LexHexFloatLiteral matches essentially (.[0-9a-fA-F]*)?[pP][+-]?[0-9a-fA-F]+
/// while making sure there are enough actual digits around for the constant to		/// while making sure there are enough actual digits around for the constant to
/// be valid.		/// be valid.
///		///
/// The leading "0x[0-9a-fA-F]*" (i.e. integer part) has already been consumed		/// The leading "0x[0-9a-fA-F]*" (i.e. integer part) has already been consumed
/// before we get here.		/// before we get here.
AsmToken AsmLexer::LexHexFloatLiteral(bool NoIntDigits) {		AsmToken AsmLexer::LexHexFloatLiteral(bool NoIntDigits) {
assert((CurPtr == 'p' \|\| CurPtr == 'P' \|\| *CurPtr == '.') &&		assert((CurPtr == 'p' \|\| CurPtr == 'P' \|\| *CurPtr == '.') &&
"unexpected parse state in floating hex");		"unexpected parse state in floating hex");
bool NoFracDigits = true;		bool NoFracDigits = true;

// Skip the fractional part if there is one		// Skip the fractional part if there is one
if (*CurPtr == '.') {		if (*CurPtr == '.') {
++CurPtr;		++CurPtr;

const char *FracStart = CurPtr;		const char *FracStart = CurPtr;
while (isxdigit(*CurPtr))		while (isHexDigit(*CurPtr))
++CurPtr;		++CurPtr;

NoFracDigits = CurPtr == FracStart;		NoFracDigits = CurPtr == FracStart;
}		}

if (NoIntDigits && NoFracDigits)		if (NoIntDigits && NoFracDigits)
return ReturnError(TokStart, "invalid hexadecimal floating-point constant: "		return ReturnError(TokStart, "invalid hexadecimal floating-point constant: "
"expected at least one significand digit");		"expected at least one significand digit");

// Make sure we do have some kind of proper exponent part		// Make sure we do have some kind of proper exponent part
if (CurPtr != 'p' && CurPtr != 'P')		if (CurPtr != 'p' && CurPtr != 'P')
return ReturnError(TokStart, "invalid hexadecimal floating-point constant: "		return ReturnError(TokStart, "invalid hexadecimal floating-point constant: "
"expected exponent part 'p'");		"expected exponent part 'p'");
++CurPtr;		++CurPtr;

if (CurPtr == '+' \|\| CurPtr == '-')		if (CurPtr == '+' \|\| CurPtr == '-')
++CurPtr;		++CurPtr;

// N.b. exponent digits are not hex		// N.b. exponent digits are not hex
const char *ExpStart = CurPtr;		const char *ExpStart = CurPtr;
while (isdigit(*CurPtr))		while (isDigit(*CurPtr))
++CurPtr;		++CurPtr;

if (CurPtr == ExpStart)		if (CurPtr == ExpStart)
return ReturnError(TokStart, "invalid hexadecimal floating-point constant: "		return ReturnError(TokStart, "invalid hexadecimal floating-point constant: "
"expected at least one exponent digit");		"expected at least one exponent digit");

return AsmToken(AsmToken::Real, StringRef(TokStart, CurPtr - TokStart));		return AsmToken(AsmToken::Real, StringRef(TokStart, CurPtr - TokStart));
}		}

/// LexIdentifier: [a-zA-Z_.][a-zA-Z0-9_$.@?]*
static bool IsIdentifierChar(char c, bool AllowAt) {
return isalnum(c) \|\| c == '_' \|\| c == '$' \|\| c == '.' \|\|
(c == '@' && AllowAt) \|\| c == '?';
}
AsmToken AsmLexer::LexIdentifier() {		AsmToken AsmLexer::LexIdentifier() {
// Check for floating point literals.		assert(IsIdentifierPrefixChar(*TokStart));
if (CurPtr[-1] == '.' && isdigit(*CurPtr)) {		while (IsIdentifierBodyChar(*CurPtr))
// Disambiguate a .1243foo identifier from a floating literal.
while (isdigit(*CurPtr))
++CurPtr;
if (CurPtr == 'e' \|\| CurPtr == 'E' \|\|
!IsIdentifierChar(*CurPtr, AllowAtInIdentifier))
return LexFloatLiteral();
}

while (IsIdentifierChar(*CurPtr, AllowAtInIdentifier))
++CurPtr;		++CurPtr;

// Handle . as a special case.
if (CurPtr == TokStart+1 && TokStart[0] == '.')
return AsmToken(AsmToken::Dot, StringRef(TokStart, 1));

return AsmToken(AsmToken::Identifier, StringRef(TokStart, CurPtr - TokStart));		return AsmToken(AsmToken::Identifier, StringRef(TokStart, CurPtr - TokStart));
}		}

/// LexSlash: Slash: /		/// LexSlash: Slash: /
/// C-Style Comment: /* ... */		/// C-Style Comment: /* ... */
AsmToken AsmLexer::LexSlash() {		AsmToken AsmLexer::LexSlash() {
switch (*CurPtr) {		switch (*CurPtr) {
case '*': break; // C style comment.		case '*': break; // C style comment.
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
}		}

// Look ahead to search for first non-hex digit, if it's [hH], then we treat the		// Look ahead to search for first non-hex digit, if it's [hH], then we treat the
// integer as a hexadecimal, possibly with leading zeroes.		// integer as a hexadecimal, possibly with leading zeroes.
static unsigned doLookAhead(const char *&CurPtr, unsigned DefaultRadix) {		static unsigned doLookAhead(const char *&CurPtr, unsigned DefaultRadix) {
const char *FirstHex = nullptr;		const char *FirstHex = nullptr;
const char *LookAhead = CurPtr;		const char *LookAhead = CurPtr;
while (1) {		while (1) {
if (isdigit(*LookAhead)) {		if (isDigit(*LookAhead)) {
++LookAhead;		++LookAhead;
} else if (isxdigit(*LookAhead)) {		} else if (isHexDigit(*LookAhead)) {
if (!FirstHex)		if (!FirstHex)
FirstHex = LookAhead;		FirstHex = LookAhead;
++LookAhead;		++LookAhead;
} else {		} else {
break;		break;
}		}
}		}
bool isHex = LookAhead == 'h' \|\| LookAhead == 'H';		bool isHex = LookAhead == 'h' \|\| LookAhead == 'H';
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	if (CurPtr[-1] != '0' \|\| CurPtr[0] == '.') {
SkipIgnoredIntegerSuffix(CurPtr);		SkipIgnoredIntegerSuffix(CurPtr);

return intToken(Result, Value);		return intToken(Result, Value);
}		}

if (*CurPtr == 'b') {		if (*CurPtr == 'b') {
++CurPtr;		++CurPtr;
// See if we actually have "0b" as part of something like "jmp 0b\n"		// See if we actually have "0b" as part of something like "jmp 0b\n"
if (!isdigit(CurPtr[0])) {		if (!isDigit(CurPtr[0])) {
--CurPtr;		--CurPtr;
StringRef Result(TokStart, CurPtr - TokStart);		StringRef Result(TokStart, CurPtr - TokStart);
return AsmToken(AsmToken::Integer, Result, 0);		return AsmToken(AsmToken::Integer, Result, 0);
}		}
const char *NumStart = CurPtr;		const char *NumStart = CurPtr;
while (CurPtr[0] == '0' \|\| CurPtr[0] == '1')		while (CurPtr[0] == '0' \|\| CurPtr[0] == '1')
++CurPtr;		++CurPtr;

Show All 12 Lines	if (*CurPtr == 'b') {
SkipIgnoredIntegerSuffix(CurPtr);		SkipIgnoredIntegerSuffix(CurPtr);

return intToken(Result, Value);		return intToken(Result, Value);
}		}

if (*CurPtr == 'x') {		if (*CurPtr == 'x') {
++CurPtr;		++CurPtr;
const char *NumStart = CurPtr;		const char *NumStart = CurPtr;
while (isxdigit(CurPtr[0]))		while (isHexDigit(CurPtr[0]))
++CurPtr;		++CurPtr;

// "0x.0p0" is valid, and "0x0p0" (but not "0xp0" for example, which will be		// "0x.0p0" is valid, and "0x0p0" (but not "0xp0" for example, which will be
// diagnosed by LexHexFloatLiteral).		// diagnosed by LexHexFloatLiteral).
if (CurPtr[0] == '.' \|\| CurPtr[0] == 'p' \|\| CurPtr[0] == 'P')		if (CurPtr[0] == '.' \|\| CurPtr[0] == 'p' \|\| CurPtr[0] == 'P')
return LexHexFloatLiteral(NumStart == CurPtr);		return LexHexFloatLiteral(NumStart == CurPtr);

// Otherwise requires at least one hex digit.		// Otherwise requires at least one hex digit.
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines
bool AsmLexer::isAtStatementSeparator(const char *Ptr) {		bool AsmLexer::isAtStatementSeparator(const char *Ptr) {
return strncmp(Ptr, MAI.getSeparatorString(),		return strncmp(Ptr, MAI.getSeparatorString(),
strlen(MAI.getSeparatorString())) == 0;		strlen(MAI.getSeparatorString())) == 0;
}		}

AsmToken AsmLexer::LexToken() {		AsmToken AsmLexer::LexToken() {
TokStart = CurPtr;		TokStart = CurPtr;
// This always consumes at least one character.		// This always consumes at least one character.
int CurChar = getNextChar();		const int CurChar = getNextChar();

if (isAtStartOfComment(TokStart)) {		if (isAtStartOfComment(TokStart)) {
// If this comment starts with a '#', then return the Hash token and let		// If this comment starts with a '#', then return the Hash token and let
// the assembler parser see if it can be parsed as a cpp line filename		// the assembler parser see if it can be parsed as a cpp line filename
// comment. We do this only if we are at the start of a line.		// comment. We do this only if we are at the start of a line.
if (CurChar == '#' && isAtStartOfLine)		if (CurChar == '#' && isAtStartOfLine)
return AsmToken(AsmToken::Hash, StringRef(TokStart, 1));		return AsmToken(AsmToken::Hash, StringRef(TokStart, 1));
isAtStartOfLine = true;		isAtStartOfLine = true;
return LexLineComment();		return LexLineComment();
}		}
if (isAtStatementSeparator(TokStart)) {		if (isAtStatementSeparator(TokStart)) {
CurPtr += strlen(MAI.getSeparatorString()) - 1;		CurPtr += strlen(MAI.getSeparatorString()) - 1;
return AsmToken(AsmToken::EndOfStatement,		return AsmToken(AsmToken::EndOfStatement,
StringRef(TokStart, strlen(MAI.getSeparatorString())));		StringRef(TokStart, strlen(MAI.getSeparatorString())));
}		}

// If we're missing a newline at EOF, make sure we still get an		// If we're missing a newline at EOF, make sure we still get an
// EndOfStatement token before the Eof token.		// EndOfStatement token before the Eof token.
if (CurChar == EOF && !isAtStartOfLine) {		if (CurChar == EOF && !isAtStartOfLine) {
isAtStartOfLine = true;		isAtStartOfLine = true;
return AsmToken(AsmToken::EndOfStatement, StringRef(TokStart, 1));		return AsmToken(AsmToken::EndOfStatement, StringRef(TokStart, 1));
}		}

isAtStartOfLine = false;		isAtStartOfLine = false;

		if (CurChar == '.' && isDigit(*CurPtr)) {
		if (!IsIdentifierPrefixChar('.'))
		return LexFloatLiteral();

		const auto SavePos = CurPtr;
		// Disambiguate a .1243foo identifier from a floating literal.
		do { ++CurPtr; }
		while (isDigit(*CurPtr));
		if (CurPtr == 'e' \|\| CurPtr == 'E' \|\| !IsIdentifierBodyChar(*CurPtr))
		return LexFloatLiteral();
		CurPtr = SavePos;
		}

		const AsmToken Id = IsIdentifierPrefixChar(CurChar) ?
		LexIdentifier() :
		AsmToken(AsmToken::Error, StringRef());

		// if single char id - need further check
		grosbachUnsubmitted Not Done Reply Inline Actions Can you elaborate on this bit? Not sure I follow why this is so much more logic than previously. grosbach: Can you elaborate on this bit? Not sure I follow why this is so much more logic than previously.
		if (Id.is(AsmToken::Identifier) && Id.getString().size() > 1)
		return Id;

switch (CurChar) {		switch (CurChar) {
default:		default:
// Handle identifier: [a-zA-Z_.][a-zA-Z0-9_$.@]*		if (Id.is(AsmToken::Identifier)) // single char id indeed
if (isalpha(CurChar) \|\| CurChar == '_' \|\| CurChar == '.')		return Id;
return LexIdentifier();

// Unknown character, emit an error.		// Unknown character, emit an error.
return ReturnError(TokStart, "invalid character in input");		return ReturnError(TokStart, "invalid character in input");
case EOF: return AsmToken(AsmToken::Eof, StringRef(TokStart, 0));		case EOF: return AsmToken(AsmToken::Eof, StringRef(TokStart, 0));
case 0:		case 0:
case ' ':		case ' ':
case '\t':		case '\t':
if (SkipSpace) {		if (SkipSpace) {
// Ignore whitespace.		// Ignore whitespace.
return LexToken();		return LexToken();
} else {		} else {
int len = 1;		int len = 1;
while (CurPtr==' ' \|\| CurPtr=='\t') {		while (CurPtr==' ' \|\| CurPtr=='\t') {
CurPtr++;		CurPtr++;
len++;		len++;
}		}
return AsmToken(AsmToken::Space, StringRef(TokStart, len));		return AsmToken(AsmToken::Space, StringRef(TokStart, len));
}		}
case '\n': // FALL THROUGH.		case '\n': // FALL THROUGH.
case '\r':		case '\r':
isAtStartOfLine = true;		isAtStartOfLine = true;
return AsmToken(AsmToken::EndOfStatement, StringRef(TokStart, 1));		return AsmToken(AsmToken::EndOfStatement, StringRef(TokStart, 1));
		case '.': return AsmToken(AsmToken::Dot, StringRef(TokStart, 1));
case ':': return AsmToken(AsmToken::Colon, StringRef(TokStart, 1));		case ':': return AsmToken(AsmToken::Colon, StringRef(TokStart, 1));
case '+': return AsmToken(AsmToken::Plus, StringRef(TokStart, 1));		case '+': return AsmToken(AsmToken::Plus, StringRef(TokStart, 1));
case '-': return AsmToken(AsmToken::Minus, StringRef(TokStart, 1));		case '-': return AsmToken(AsmToken::Minus, StringRef(TokStart, 1));
case '~': return AsmToken(AsmToken::Tilde, StringRef(TokStart, 1));		case '~': return AsmToken(AsmToken::Tilde, StringRef(TokStart, 1));
case '(': return AsmToken(AsmToken::LParen, StringRef(TokStart, 1));		case '(': return AsmToken(AsmToken::LParen, StringRef(TokStart, 1));
case ')': return AsmToken(AsmToken::RParen, StringRef(TokStart, 1));		case ')': return AsmToken(AsmToken::RParen, StringRef(TokStart, 1));
case '[': return AsmToken(AsmToken::LBrac, StringRef(TokStart, 1));		case '[': return AsmToken(AsmToken::LBrac, StringRef(TokStart, 1));
case ']': return AsmToken(AsmToken::RBrac, StringRef(TokStart, 1));		case ']': return AsmToken(AsmToken::RBrac, StringRef(TokStart, 1));
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

lib/MC/MCParser/MCAsmLexer.cpp

	Show All 13 Lines

	MCAsmLexer::MCAsmLexer() : TokStart(nullptr), SkipSpace(true) {			MCAsmLexer::MCAsmLexer() : TokStart(nullptr), SkipSpace(true) {
	CurTok.emplace_back(AsmToken::Error, StringRef());			CurTok.emplace_back(AsmToken::Error, StringRef());
	}			}

	MCAsmLexer::~MCAsmLexer() {			MCAsmLexer::~MCAsmLexer() {
	}			}

				void MCAsmLexer::setIdentifierCharSet(bool Value,
				StringRef PfxCharSet,
				StringRef BodyCharSet) {
				grosbachUnsubmitted Not Done Reply Inline Actions Given the bimodal behaviour based on Value, this should probably just be two functions. grosbach: Given the bimodal behaviour based on Value, this should probably just be two functions.
				}

	SMLoc MCAsmLexer::getLoc() const {			SMLoc MCAsmLexer::getLoc() const {
	return SMLoc::getFromPointer(TokStart);			return SMLoc::getFromPointer(TokStart);
	}			}

	SMLoc AsmToken::getLoc() const {			SMLoc AsmToken::getLoc() const {
	return SMLoc::getFromPointer(Str.data());			return SMLoc::getFromPointer(Str.data());
	}			}

	SMLoc AsmToken::getEndLoc() const {			SMLoc AsmToken::getEndLoc() const {
	return SMLoc::getFromPointer(Str.data() + Str.size());			return SMLoc::getFromPointer(Str.data() + Str.size());
	}			}

	SMRange AsmToken::getLocRange() const {			SMRange AsmToken::getLocRange() const {
	return SMRange(getLoc(), getEndLoc());			return SMRange(getLoc(), getEndLoc());
	}			}