This is an archive of the discontinued LLVM Phabricator instance.

[AsmParser] Allow tokens to be put back in to the token stream.
ClosedPublic

Authored by colinl on Nov 2 2015, 12:05 PM.

Download Raw Diff

Details

Reviewers

sidneym
craig.topper
• rafael
mcrosier

Commits

rGa4c85d4c96fd: [AsmParser] Allow tokens to be put back in to the token stream.
rL252432: [AsmParser] Allow tokens to be put back in to the token stream.

Summary

This patch modifies the AsmParser to allow tokens to be injected back in to the token stream. This is useful for parsing some grammars that otherwise confuse the lexer.

Examples:
"v31:30.h" The sequence seen by the lexer is Token "v31" Colon ":" Double "30." Token ".h" It's easier to change the tokens to Token "v31:30" Token ".h"

"memw(r0 << #2 + #60)" This sequence confuses the expression parser when parsing "2 + #", The plus looks like an infix addition but it's actually part of the instruction text.

Diff Detail

Repository: rL LLVM

Event Timeline

colinl updated this revision to Diff 38959.Nov 2 2015, 12:05 PM

colinl retitled this revision from to [AsmParser] Allow tokens to be put back in to the token stream..

colinl updated this object.

colinl added reviewers: mcrosier, sidneym.

colinl set the repository for this revision to rL LLVM.

colinl added a subscriber: llvm-commits.

colinl added reviewers: • rafael, craig.topper.Nov 2 2015, 7:24 PM

jevinskie added a subscriber: jevinskie.Nov 4 2015, 11:06 AM

Although a datastructure like a deque or a ring buffer would more directly represent what we're trying to do, a SmallVector was chosen because some backends do the following sequence:

AsmToken const &Token = Lexer.getTok();
useToken(Token);
Lexer.Lex();
useToken(Token);

Since the number of tokens in the buffer is going to be small, usually 1, a SmallVector with pushing to head should be fine.

There should be no functional change.

I noticed some backends have copy/pasted the built in expression parser for similar reasons as above, they could possibly make use of this change.

Closed by commit rL252432: [AsmParser] Allow tokens to be put back in to the token stream. (authored by colinl). · Explain WhyNov 8 2015, 3:50 PM

This revision was automatically updated to reflect the committed changes.

What are the alternatives? This is a pretty nasty hack.

With the expression parsing example failing on "+ #" it seemed like the only alternative was to copy/paste the built-in expression parser and modify it to deal with this situation.

Another instance where this was helpful is in HexagonAsmParser::ParseInstruction. Previously we only got the string for the first token in a statement, this is because usually the first token is a mnemonic string which isn't true here. An option would be to modify AsmParser and not consume the token but a bunch of the logic around labels, directives, etc assumed the first token was consumed.

A lot of this is because different ASM grammars lex differently, I'm not sure having one lexer/parser for everything is sustainable. I'm open to ideas.

Would it be reasonable to instead add target hooks for things like this? Either specialized things like the comment character handling or something more general like the ability to override recognizing an integer literal or whatever.

I think that's a more ideal situation, in related commits I did add the ability for targets to specify additional tokens http://reviews.llvm.org/D14256

It seems like if things get much more complicated we're going to need a real, target specified parser and lexer produced by parser generator tools.

I'll keep looking at it though, maybe I can move things around and rewrite them in terms of peekTokens. Then again peekTokens could be rewritten in terms of this.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

MC/

MCParser/

MCAsmLexer.h

14 lines

lib/

MC/

MCParser/

MCAsmLexer.cpp

4 lines

Diff 39656

llvm/trunk/include/llvm/MC/MCParser/MCAsmLexer.h

Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	APInt getAPIntVal() const {
return IntVal;		return IntVal;
}		}
};		};

/// Generic assembler lexer interface, for use by target specific assembly		/// Generic assembler lexer interface, for use by target specific assembly
/// lexers.		/// lexers.
class MCAsmLexer {		class MCAsmLexer {
/// The current token, stored in the base class for faster access.		/// The current token, stored in the base class for faster access.
AsmToken CurTok;		SmallVector<AsmToken, 1> CurTok;

/// The location and description of the current error		/// The location and description of the current error
SMLoc ErrLoc;		SMLoc ErrLoc;
std::string Err;		std::string Err;

MCAsmLexer(const MCAsmLexer &) = delete;		MCAsmLexer(const MCAsmLexer &) = delete;
void operator=(const MCAsmLexer &) = delete;		void operator=(const MCAsmLexer &) = delete;
protected: // Can only create subclasses.		protected: // Can only create subclasses.
Show All 13 Lines
public:		public:
virtual ~MCAsmLexer();		virtual ~MCAsmLexer();

/// Consume the next token from the input stream and return it.		/// Consume the next token from the input stream and return it.
///		///
/// The lexer will continuosly return the end-of-file token once the end of		/// The lexer will continuosly return the end-of-file token once the end of
/// the main input file has been reached.		/// the main input file has been reached.
const AsmToken &Lex() {		const AsmToken &Lex() {
return CurTok = LexToken();		assert(!CurTok.empty());
		CurTok.erase(CurTok.begin());
		if (CurTok.empty())
		CurTok.emplace_back(LexToken());
		return CurTok.front();
		}

		void UnLex(AsmToken const &Token) {
		CurTok.insert(CurTok.begin(), Token);
}		}

virtual StringRef LexUntilEndOfStatement() = 0;		virtual StringRef LexUntilEndOfStatement() = 0;

/// Get the current source location.		/// Get the current source location.
SMLoc getLoc() const;		SMLoc getLoc() const;

/// Get the current (last) lexed token.		/// Get the current (last) lexed token.
const AsmToken &getTok() const {		const AsmToken &getTok() const {
return CurTok;		return CurTok[0];
}		}

/// Look ahead at the next token to be lexed.		/// Look ahead at the next token to be lexed.
const AsmToken peekTok(bool ShouldSkipSpace = true) {		const AsmToken peekTok(bool ShouldSkipSpace = true) {
AsmToken Tok;		AsmToken Tok;

MutableArrayRef<AsmToken> Buf(Tok);		MutableArrayRef<AsmToken> Buf(Tok);
size_t ReadCount = peekTokens(Buf, ShouldSkipSpace);		size_t ReadCount = peekTokens(Buf, ShouldSkipSpace);
Show All 40 Lines

llvm/trunk/lib/MC/MCParser/MCAsmLexer.cpp

	//===-- MCAsmLexer.cpp - Abstract Asm Lexer Interface ---------------------===//			//===-- MCAsmLexer.cpp - Abstract Asm Lexer Interface ---------------------===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "llvm/MC/MCParser/MCAsmLexer.h"			#include "llvm/MC/MCParser/MCAsmLexer.h"
	#include "llvm/Support/SourceMgr.h"			#include "llvm/Support/SourceMgr.h"

	using namespace llvm;			using namespace llvm;

	MCAsmLexer::MCAsmLexer() : CurTok(AsmToken::Error, StringRef()),			MCAsmLexer::MCAsmLexer() : TokStart(nullptr), SkipSpace(true) {
	TokStart(nullptr), SkipSpace(true) {			CurTok.emplace_back(AsmToken::Error, StringRef());
	}			}

	MCAsmLexer::~MCAsmLexer() {			MCAsmLexer::~MCAsmLexer() {
	}			}

	SMLoc MCAsmLexer::getLoc() const {			SMLoc MCAsmLexer::getLoc() const {
	return SMLoc::getFromPointer(TokStart);			return SMLoc::getFromPointer(TokStart);
	}			}
	Show All 12 Lines