This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/CodeGen/MIRParser/
-
CodeGen/
-
MIRParser/
-
CMakeLists.txt
-
MILexer.h
-
MILexer.cpp
-
MIParser.cpp
-
test/CodeGen/MIR/X86/
-
CodeGen/
-
MIR/
-
X86/
-
machine-instructions.mir
-
missing-instruction.mir
-
unrecognized-character.mir

Differential D10521

MIR Parsing: Introduce a MI Lexing class.
ClosedPublic

Authored by arphaman on Jun 17 2015, 4:27 PM.

Download Raw Diff

Details

Reviewers

bob.wilson
bogner
dexonsmith

Commits

rG91370c5d6272: MIR Serialization: Introduce a lexer for machine instructions.
rL240323: MIR Serialization: Introduce a lexer for machine instructions.

Summary

This patch is based on a patch that serializes machine instruction names (http://reviews.llvm.org/D10481).

This patch adds a MILexer class that complements the MIParser class from the previous patch. It still only allows to serialized machine instruction names, but this time it performs tokenization of the source string to allow the MIParser to progress to parsing of machine operands.

Diff Detail

Repository: rL LLVM

Event Timeline

arphaman updated this revision to Diff 27891.Jun 17 2015, 4:27 PM

arphaman retitled this revision from to MIR Parsing: Introduce a MI Lexing class..

arphaman updated this object.

arphaman edited the test plan for this revision. (Show Details)

arphaman added reviewers: dexonsmith, bob.wilson, bogner.

arphaman set the repository for this revision to rL LLVM.

arphaman added a subscriber: Unknown Object (MLST).

silvas added a subscriber: silvas.Jun 17 2015, 10:09 PM

silvas added inline comments.

lib/CodeGen/MIRParser/MILexer.cpp
32–36 ↗	(On Diff #27891)	Since we ultimately want to associate this back to the yaml file, using SourceMgr here doesn't seem like a very good fit. It sounds a lot more convenient to have the lexer just be StringRef based, then we can just add some simple newline and character counting to issue a custom diagnostic back in the YAML (or reuse SourceMgr just for that part, outside the core lexing). In the past I have found this core lexing interface to be useful (supplemented by a sugar "Lexer" class in the header): struct Token { ... Kind (which can be an error), etc. ... StringRef Range; }; StringRef lex(StringRef Range, Token &OutTok); // (or lexImpl or whatever). The return value is a new range, a suffix of the old range, containing the remaining yet-to-be-lexed characters. Some useful invariants to maintain are: Token Tok, OtherTok; StringRef R = ...., NewR; NewR = lex(R, Tok); assert(lex(Tok.Range, OtherTok).size() == 0); // Entire range of a valid token is consumed. assert(Tok == OtherTok); // The exact token is recovered by re-lexing. A convenient Lexer class can then be trivially built around this in the header (alongside token kind definitions and such). Everything else is in the .cpp file and nicely decoupled. When I've used this in the past, the first thing to do in lexImpl is to stuff the incoming stringref into a trivial "Cursor" class that has a .peek() method which checks for EOF and returns '\0', otherwise the char at the cursor. This eliminates a lot of repeated "!isEOF() && is...." checks (your patch already has two of them: "while (!isEOF() && isspace(CurPtr))" and "while (!isEOF() && isIdentifierChar(CurPtr))"); these can then be written e.g. `while (isspace(C.peek())`. The main body of lexImpl then becomes something like: StringRef lexImpl(StringRef R, Token &OutTok) { Cursor C(R); skipWhitespace(C); if (C.isEOF()) return StringRef(); if (Cursor RC = maybeLexDelimiter(C, OutTok)) return RC.remaining(); if (Cursor RC = maybeLexIdentifier(C, OutTok)) return RC.remaining(); if (Cursor RC = maybeLexNumber(C, OutTok)) return RC.remaining(); .... }

arphaman added inline comments.Jun 18 2015, 9:59 AM

lib/CodeGen/MIRParser/MILexer.cpp
32–36 ↗	(On Diff #27891)	This would work for me, I'll put up an updated patch that implements a lexer using this kind of approach later today.

Updated the patch to use an approach similar to the one suggested by Sean for lexing.

silvas added inline comments.Jun 18 2015, 5:52 PM

lib/CodeGen/MIRParser/MIParser.cpp
29 ↗	(On Diff #27955)	Is there a reason you chose is-a instead of has-a here? I don't think I've ever seen a parser implemented as a subclass of the lexer.
70 ↗	(On Diff #27955)	This lex() function don't seem to be buying you very much. It seems like it might as well just contain a call to a free function equivalent of what is currently the member function MILexer::lexToken. E.g. it could just be `void MIParser::lex() { CurrentSource = lexToken(CurrentSource, Token, ErrorCallback); }`.

arphaman added inline comments.Jun 18 2015, 7:16 PM

lib/CodeGen/MIRParser/MIParser.cpp
29 ↗	(On Diff #27955)	This was done so that the parser could override the error reporting method in the lexer so that the lexer could report errors. I will remove this, as I can just pass an error handler function to the lexToken method which will be called directly from the parser.
70 ↗	(On Diff #27955)	Sure, this makes sense. I'll post the updated patch tomorrow.

The updated patch removes the MILexer class and uses just a single function to lex tokens.

Overall, this LGTM. A couple small suggestions in case performance is ever a problem here.

lib/CodeGen/MIRParser/MILexer.h
58–60 ↗	(On Diff #28038)	We probably don't want to be passing a std::function here by value. Probably just use a reference to std::function or llvm::function_ref http://llvm.org/docs/ProgrammersManual.html#the-function-ref-class-template
lib/CodeGen/MIRParser/MIParser.cpp
74–76 ↗	(On Diff #28038)	Could you doublecheck that the compiler is able to optimize the materialization of the std::function here? If it isn't doing a good job, then it might be better to just construct the std::function once in the MIParser constructor and store it as a member.

2015-06-19 16:40 GMT-07:00 Sean Silva <chisophugis@gmail.com>:

Overall, this LGTM. A couple small suggestions in case performance is ever
a problem here.

REPOSITORY
rL LLVM
Comment at: lib/CodeGen/MIRParser/MILexer.h:58-60
@@ +57,5 @@
+/// the remaining source string.
+StringRef lexMIToken(
+ StringRef Source, MIToken &Token,
+ std::function<void(StringRef::iterator, const Twine &)>
ErrorCallback);

+

We probably don't want to be passing a std::function here by value.
Probably just use a reference to std::function or llvm::function_ref
http://llvm.org/docs/ProgrammersManual.html#the-function-ref-class-template

Thanks, llvm::function_ref is perfect here.

Comment at: lib/CodeGen/MIRParser/MIParser.cpp:74-76
@@ +73,5 @@
+void MIParser::lex() {
+ CurrentSource = lexMIToken(
+ CurrentSource, Token,
+ [this](StringRef::iterator Loc, const Twine &Msg) { error(Loc,
Msg); });

+}

Could you doublecheck that the compiler is able to optimize the
materialization of the std::function here? If it isn't doing a good job,
then it might be better to just construct the std::function once in the
MIParser constructor and store it as a member.

The llvm::function_ref materialization is optimized out very effectively by
the compiler during the optimized build.

http://reviews.llvm.org/D10521

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/

Closed by commit rL240323: MIR Serialization: Introduce a lexer for machine instructions. (authored by arphaman). · Explain WhyJun 22 2015, 1:42 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

MIRParser/

1 line

65 lines

87 lines

51 lines

test/

CodeGen/

MIR/

X86/

machine-instructions.mir

2 lines

missing-instruction.mir

18 lines

unrecognized-character.mir

18 lines

Diff 28152

llvm/trunk/lib/CodeGen/MIRParser/CMakeLists.txt

	add_llvm_library(LLVMMIRParser			add_llvm_library(LLVMMIRParser
				MILexer.cpp
	MIParser.cpp			MIParser.cpp
	MIRParser.cpp			MIRParser.cpp
	)			)

	add_dependencies(LLVMMIRParser intrinsics_gen)			add_dependencies(LLVMMIRParser intrinsics_gen)

llvm/trunk/lib/CodeGen/MIRParser/MILexer.h

				//===- MILexer.h - Lexer for machine instructions -------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file declares the function that lexes the machine instruction source
				// string.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_CODEGEN_MIRPARSER_MILEXER_H
				#define LLVM_LIB_CODEGEN_MIRPARSER_MILEXER_H

				#include "llvm/ADT/StringRef.h"
				#include "llvm/ADT/STLExtras.h"
				#include <functional>

				namespace llvm {

				class Twine;

				/// A token produced by the machine instruction lexer.
				struct MIToken {
				enum TokenKind {
				// Markers
				Eof,
				Error,

				// Identifier tokens
				Identifier
				};

				private:
				TokenKind Kind;
				StringRef Range;

				public:
				MIToken(TokenKind Kind, StringRef Range) : Kind(Kind), Range(Range) {}

				TokenKind kind() const { return Kind; }

				bool isError() const { return Kind == Error; }

				bool is(TokenKind K) const { return Kind == K; }

				bool isNot(TokenKind K) const { return Kind != K; }

				StringRef::iterator location() const { return Range.begin(); }

				StringRef stringValue() const { return Range; }
				};

				/// Consume a single machine instruction token in the given source and return
				/// the remaining source string.
				StringRef lexMIToken(
				StringRef Source, MIToken &Token,
				function_ref<void(StringRef::iterator, const Twine &)> ErrorCallback);

				} // end namespace llvm

				#endif

llvm/trunk/lib/CodeGen/MIRParser/MILexer.cpp

				//===- MILexer.cpp - Machine instructions lexer implementation ----------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements the lexing of machine instructions.
				//
				//===----------------------------------------------------------------------===//

				#include "MILexer.h"
				#include "llvm/ADT/Twine.h"
				#include <cctype>

				using namespace llvm;

				namespace {

				/// This class provides a way to iterate and get characters from the source
				/// string.
				class Cursor {
				const char *Ptr;
				const char *End;

				public:
				explicit Cursor(StringRef Str) {
				Ptr = Str.data();
				End = Ptr + Str.size();
				}

				bool isEOF() const { return Ptr == End; }

				char peek() const { return isEOF() ? 0 : *Ptr; }

				void advance() { ++Ptr; }

				StringRef remaining() const { return StringRef(Ptr, End - Ptr); }

				StringRef upto(Cursor C) const {
				assert(C.Ptr >= Ptr && C.Ptr <= End);
				return StringRef(Ptr, C.Ptr - Ptr);
				}

				StringRef::iterator location() const { return Ptr; }
				};

				} // end anonymous namespace

				/// Skip the leading whitespace characters and return the updated cursor.
				static Cursor skipWhitespace(Cursor C) {
				while (isspace(C.peek()))
				C.advance();
				return C;
				}

				static bool isIdentifierChar(char C) {
				return isalpha(C) \|\| isdigit(C) \|\| C == '_' \|\| C == '-' \|\| C == '.';
				}

				static Cursor lexIdentifier(Cursor C, MIToken &Token) {
				auto Range = C;
				while (isIdentifierChar(C.peek()))
				C.advance();
				Token = MIToken(MIToken::Identifier, Range.upto(C));
				return C;
				}

				StringRef llvm::lexMIToken(
				StringRef Source, MIToken &Token,
				function_ref<void(StringRef::iterator Loc, const Twine &)> ErrorCallback) {
				auto C = skipWhitespace(Cursor(Source));
				if (C.isEOF()) {
				Token = MIToken(MIToken::Eof, C.remaining());
				return C.remaining();
				}

				auto Char = C.peek();
				if (isalpha(Char) \|\| Char == '_')
				return lexIdentifier(C, Token).remaining();
				Token = MIToken(MIToken::Error, C.remaining());
				ErrorCallback(C.location(),
				Twine("unexpected character '") + Twine(Char) + "'");
				return C.remaining();
				}

llvm/trunk/lib/CodeGen/MIRParser/MIParser.cpp

	//===- MIParser.cpp - Machine instructions parser implementation ----------===//			//===- MIParser.cpp - Machine instructions parser implementation ----------===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the parsing of machine instructions.			// This file implements the parsing of machine instructions.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "MIParser.h"			#include "MIParser.h"
				#include "MILexer.h"
	#include "llvm/ADT/StringMap.h"			#include "llvm/ADT/StringMap.h"
	#include "llvm/CodeGen/MachineBasicBlock.h"			#include "llvm/CodeGen/MachineBasicBlock.h"
	#include "llvm/CodeGen/MachineFunction.h"			#include "llvm/CodeGen/MachineFunction.h"
	#include "llvm/CodeGen/MachineInstr.h"			#include "llvm/CodeGen/MachineInstr.h"
	#include "llvm/Support/raw_ostream.h"			#include "llvm/Support/raw_ostream.h"
	#include "llvm/Support/SourceMgr.h"			#include "llvm/Support/SourceMgr.h"
	#include "llvm/Target/TargetSubtargetInfo.h"			#include "llvm/Target/TargetSubtargetInfo.h"
	#include "llvm/Target/TargetInstrInfo.h"			#include "llvm/Target/TargetInstrInfo.h"

	using namespace llvm;			using namespace llvm;

	namespace {			namespace {

	class MIParser {			class MIParser {
	SourceMgr &SM;			SourceMgr &SM;
	MachineFunction &MF;			MachineFunction &MF;
	SMDiagnostic &Error;			SMDiagnostic &Error;
	StringRef Source;			StringRef Source, CurrentSource;
				MIToken Token;
	/// Maps from instruction names to op codes.			/// Maps from instruction names to op codes.
	StringMap<unsigned> Names2InstrOpCodes;			StringMap<unsigned> Names2InstrOpCodes;

	public:			public:
	MIParser(SourceMgr &SM, MachineFunction &MF, SMDiagnostic &Error,			MIParser(SourceMgr &SM, MachineFunction &MF, SMDiagnostic &Error,
	StringRef Source);			StringRef Source);

				void lex();

	/// Report an error at the current location with the given message.			/// Report an error at the current location with the given message.
	///			///
	/// This function always return true.			/// This function always return true.
	bool error(const Twine &Msg);			bool error(const Twine &Msg);

				/// Report an error at the given location with the given message.
				///
				/// This function always return true.
				bool error(StringRef::iterator Loc, const Twine &Msg);

	MachineInstr *parse();			MachineInstr *parse();

	private:			private:
	void initNames2InstrOpCodes();			void initNames2InstrOpCodes();

	/// Try to convert an instruction name to an opcode. Return true if the			/// Try to convert an instruction name to an opcode. Return true if the
	/// instruction name is invalid.			/// instruction name is invalid.
	bool parseInstrName(StringRef InstrName, unsigned &OpCode);			bool parseInstrName(StringRef InstrName, unsigned &OpCode);

				bool parseInstruction(unsigned &OpCode);
	};			};

	} // end anonymous namespace			} // end anonymous namespace

	MIParser::MIParser(SourceMgr &SM, MachineFunction &MF, SMDiagnostic &Error,			MIParser::MIParser(SourceMgr &SM, MachineFunction &MF, SMDiagnostic &Error,
	StringRef Source)			StringRef Source)
	: SM(SM), MF(MF), Error(Error), Source(Source) {}			: SM(SM), MF(MF), Error(Error), Source(Source), CurrentSource(Source),
				Token(MIToken::Error, StringRef()) {}

				void MIParser::lex() {
				CurrentSource = lexMIToken(
				CurrentSource, Token,
				[this](StringRef::iterator Loc, const Twine &Msg) { error(Loc, Msg); });
				}

	bool MIParser::error(const Twine &Msg) {			bool MIParser::error(const Twine &Msg) { return error(Token.location(), Msg); }

				bool MIParser::error(StringRef::iterator Loc, const Twine &Msg) {
	// TODO: Get the proper location in the MIR file, not just a location inside			// TODO: Get the proper location in the MIR file, not just a location inside
	// the string.			// the string.
	Error =			assert(Loc >= Source.data() && Loc <= (Source.data() + Source.size()));
	SMDiagnostic(SM, SMLoc(), SM.getMemoryBuffer(SM.getMainFileID())			Error = SMDiagnostic(
	->getBufferIdentifier(),			SM, SMLoc(),
	1, 0, SourceMgr::DK_Error, Msg.str(), Source, None, None);			SM.getMemoryBuffer(SM.getMainFileID())->getBufferIdentifier(), 1,
				Loc - Source.data(), SourceMgr::DK_Error, Msg.str(), Source, None, None);
	return true;			return true;
	}			}

	MachineInstr *MIParser::parse() {			MachineInstr *MIParser::parse() {
	StringRef InstrName = Source;			lex();

	unsigned OpCode;			unsigned OpCode;
	if (parseInstrName(InstrName, OpCode)) {			if (Token.isError() \|\| parseInstruction(OpCode))
	error(Twine("unknown machine instruction name '") + InstrName + "'");
	return nullptr;			return nullptr;
	}

	// TODO: Parse the rest of instruction - machine operands, etc.			// TODO: Parse the rest of instruction - machine operands, etc.
	const auto &MCID = MF.getSubtarget().getInstrInfo()->get(OpCode);			const auto &MCID = MF.getSubtarget().getInstrInfo()->get(OpCode);
	auto *MI = MF.CreateMachineInstr(MCID, DebugLoc());			auto *MI = MF.CreateMachineInstr(MCID, DebugLoc());
	return MI;			return MI;
	}			}

				bool MIParser::parseInstruction(unsigned &OpCode) {
				if (Token.isNot(MIToken::Identifier))
				return error("expected a machine instruction");
				StringRef InstrName = Token.stringValue();
				if (parseInstrName(InstrName, OpCode))
				return error(Twine("unknown machine instruction name '") + InstrName + "'");
				return false;
				}

	void MIParser::initNames2InstrOpCodes() {			void MIParser::initNames2InstrOpCodes() {
	if (!Names2InstrOpCodes.empty())			if (!Names2InstrOpCodes.empty())
	return;			return;
	const auto *TII = MF.getSubtarget().getInstrInfo();			const auto *TII = MF.getSubtarget().getInstrInfo();
	assert(TII && "Expected target instruction info");			assert(TII && "Expected target instruction info");
	for (unsigned I = 0, E = TII->getNumOpcodes(); I < E; ++I)			for (unsigned I = 0, E = TII->getNumOpcodes(); I < E; ++I)
	Names2InstrOpCodes.insert(std::make_pair(StringRef(TII->getName(I)), I));			Names2InstrOpCodes.insert(std::make_pair(StringRef(TII->getName(I)), I));
	}			}
	Show All 14 Lines

llvm/trunk/test/CodeGen/MIR/X86/machine-instructions.mir

	Show All 14 Lines
	# CHECK: name: inc			# CHECK: name: inc
	name: inc			name: inc
	body:			body:
	- name: entry			- name: entry
	instructions:			instructions:
	# CHECK: - IMUL32rri8			# CHECK: - IMUL32rri8
	# CHECK-NEXT: - RETQ			# CHECK-NEXT: - RETQ
	- IMUL32rri8			- IMUL32rri8
	- RETQ			- ' RETQ '
	...			...

llvm/trunk/test/CodeGen/MIR/X86/missing-instruction.mir

				# RUN: not llc -march=x86-64 -start-after branch-folder -stop-after branch-folder -o /dev/null %s 2>&1 \| FileCheck %s

				--- \|

				define void @foo() {
				entry:
				ret void
				}

				...
				---
				name: foo
				body:
				- name: entry
				instructions:
				# CHECK: 1:1: expected a machine instruction
				- ''
				...

llvm/trunk/test/CodeGen/MIR/X86/unrecognized-character.mir

				# RUN: not llc -march=x86-64 -start-after branch-folder -stop-after branch-folder -o /dev/null %s 2>&1 \| FileCheck %s

				--- \|

				define void @foo() {
				entry:
				ret void
				}

				...
				---
				name: foo
				body:
				- name: entry
				instructions:
				# CHECK: 1:1: unexpected character '`'
				- '` RETQ'
				...