This is an archive of the discontinued LLVM Phabricator instance.

MIR Parsing: Introduce a MI Lexing class.
ClosedPublic

Authored by arphaman on Jun 17 2015, 4:27 PM.

Details

Summary

This patch is based on a patch that serializes machine instruction names (http://reviews.llvm.org/D10481).

This patch adds a MILexer class that complements the MIParser class from the previous patch. It still only allows to serialized machine instruction names, but this time it performs tokenization of the source string to allow the MIParser to progress to parsing of machine operands.

Diff Detail

Repository
rL LLVM

Event Timeline

arphaman updated this revision to Diff 27891.Jun 17 2015, 4:27 PM
arphaman retitled this revision from to MIR Parsing: Introduce a MI Lexing class..
arphaman updated this object.
arphaman edited the test plan for this revision. (Show Details)
arphaman added reviewers: dexonsmith, bob.wilson, bogner.
arphaman set the repository for this revision to rL LLVM.
arphaman added a subscriber: Unknown Object (MLST).
silvas added a subscriber: silvas.Jun 17 2015, 10:09 PM
silvas added inline comments.
lib/CodeGen/MIRParser/MILexer.cpp
32–36 ↗(On Diff #27891)

Since we ultimately want to associate this back to the yaml file, using SourceMgr here doesn't seem like a very good fit. It sounds a lot more convenient to have the lexer just be StringRef based, then we can just add some simple newline and character counting to issue a custom diagnostic back in the YAML (or reuse SourceMgr just for that part, outside the core lexing).

In the past I have found this core lexing interface to be useful (supplemented by a sugar "Lexer" class in the header):

struct Token {
  ... Kind (which can be an error), etc. ...
  StringRef Range;
};
StringRef lex(StringRef Range, Token &OutTok);
// (or lexImpl or whatever).

The return value is a new range, a suffix of the old range, containing the remaining yet-to-be-lexed characters.
Some useful invariants to maintain are:

Token Tok, OtherTok;
StringRef R = ...., NewR;
NewR = lex(R, Tok);
assert(lex(Tok.Range, OtherTok).size() == 0); // Entire range of a valid token is consumed.
assert(Tok == OtherTok); // The exact token is recovered by re-lexing.

A convenient Lexer class can then be trivially built around this in the header (alongside token kind definitions and such). Everything else is in the .cpp file and nicely decoupled.

When I've used this in the past, the first thing to do in lexImpl is to stuff the incoming stringref into a trivial "Cursor" class that has a .peek() method which checks for EOF and returns '\0', otherwise the char at the cursor. This eliminates *a lot* of repeated "!isEOF() && is...." checks (your patch already has two of them: "while (!isEOF() && isspace(*CurPtr))" and "while (!isEOF() && isIdentifierChar(*CurPtr))"); these can then be written e.g. while (isspace(C.peek()). The main body of lexImpl then becomes something like:

StringRef lexImpl(StringRef R, Token &OutTok) {
  Cursor C(R);
  skipWhitespace(C);
  if (C.isEOF())
    return StringRef();
  if (Cursor RC = maybeLexDelimiter(C, OutTok))
    return RC.remaining();
  if (Cursor RC = maybeLexIdentifier(C, OutTok))
    return RC.remaining();
  if (Cursor RC = maybeLexNumber(C, OutTok))
    return RC.remaining();
  ....
}
arphaman added inline comments.Jun 18 2015, 9:59 AM
lib/CodeGen/MIRParser/MILexer.cpp
32–36 ↗(On Diff #27891)

This would work for me, I'll put up an updated patch that implements a lexer using this kind of approach later today.

arphaman updated this revision to Diff 27955.Jun 18 2015, 1:10 PM

Updated the patch to use an approach similar to the one suggested by Sean for lexing.

silvas added inline comments.Jun 18 2015, 5:52 PM
lib/CodeGen/MIRParser/MIParser.cpp
29 ↗(On Diff #27955)

Is there a reason you chose is-a instead of has-a here? I don't think I've ever seen a parser implemented as a subclass of the lexer.

70 ↗(On Diff #27955)

This lex() function don't seem to be buying you very much. It seems like it might as well just contain a call to a free function equivalent of what is currently the member function MILexer::lexToken. E.g. it could just be void MIParser::lex() { CurrentSource = lexToken(CurrentSource, Token, ErrorCallback); }.

arphaman added inline comments.Jun 18 2015, 7:16 PM
lib/CodeGen/MIRParser/MIParser.cpp
29 ↗(On Diff #27955)

This was done so that the parser could override the error reporting method in the lexer so that the lexer could report errors. I will remove this, as I can just pass an error handler function to the lexToken method which will be called directly from the parser.

70 ↗(On Diff #27955)

Sure, this makes sense. I'll post the updated patch tomorrow.

arphaman updated this revision to Diff 28038.Jun 19 2015, 12:54 PM

The updated patch removes the MILexer class and uses just a single function to lex tokens.

Overall, this LGTM. A couple small suggestions in case performance is ever a problem here.

lib/CodeGen/MIRParser/MILexer.h
58–60 ↗(On Diff #28038)

We probably don't want to be passing a std::function here by value. Probably just use a reference to std::function or llvm::function_ref http://llvm.org/docs/ProgrammersManual.html#the-function-ref-class-template

lib/CodeGen/MIRParser/MIParser.cpp
74–76 ↗(On Diff #28038)

Could you doublecheck that the compiler is able to optimize the materialization of the std::function here? If it isn't doing a good job, then it might be better to just construct the std::function once in the MIParser constructor and store it as a member.

2015-06-19 16:40 GMT-07:00 Sean Silva <chisophugis@gmail.com>:

Overall, this LGTM. A couple small suggestions in case performance is ever
a problem here.

REPOSITORY

rL LLVM

Comment at: lib/CodeGen/MIRParser/MILexer.h:58-60
@@ +57,5 @@
+/// the remaining source string.
+StringRef lexMIToken(
+ StringRef Source, MIToken &Token,
+ std::function<void(StringRef::iterator, const Twine &)>
ErrorCallback);

+

We probably don't want to be passing a std::function here by value.
Probably just use a reference to std::function or llvm::function_ref
http://llvm.org/docs/ProgrammersManual.html#the-function-ref-class-template

Thanks, llvm::function_ref is perfect here.

Comment at: lib/CodeGen/MIRParser/MIParser.cpp:74-76
@@ +73,5 @@
+void MIParser::lex() {
+ CurrentSource = lexMIToken(
+ CurrentSource, Token,
+ [this](StringRef::iterator Loc, const Twine &Msg) { error(Loc,
Msg); });

+}

Could you doublecheck that the compiler is able to optimize the
materialization of the std::function here? If it isn't doing a good job,
then it might be better to just construct the std::function once in the
MIParser constructor and store it as a member.

The llvm::function_ref materialization is optimized out very effectively by
the compiler during the optimized build.

http://reviews.llvm.org/D10521

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/
This revision was automatically updated to reflect the committed changes.