This is an archive of the discontinued LLVM Phabricator instance.

[AsmParser] Allow tokens to be put back in to the token stream.
ClosedPublic

Authored by colinl on Nov 2 2015, 12:05 PM.

Details

Summary

This patch modifies the AsmParser to allow tokens to be injected back in to the token stream. This is useful for parsing some grammars that otherwise confuse the lexer.

Examples:
"v31:30.h" The sequence seen by the lexer is Token "v31" Colon ":" Double "30." Token ".h" It's easier to change the tokens to Token "v31:30" Token ".h"

"memw(r0 << #2 + #60)" This sequence confuses the expression parser when parsing "2 + #", The plus looks like an infix addition but it's actually part of the instruction text.

Diff Detail

Repository
rL LLVM

Event Timeline

colinl updated this revision to Diff 38959.Nov 2 2015, 12:05 PM
colinl retitled this revision from to [AsmParser] Allow tokens to be put back in to the token stream..
colinl updated this object.
colinl added reviewers: mcrosier, sidneym.
colinl set the repository for this revision to rL LLVM.
colinl added a subscriber: llvm-commits.

Although a datastructure like a deque or a ring buffer would more directly represent what we're trying to do, a SmallVector was chosen because some backends do the following sequence:

AsmToken const &Token = Lexer.getTok();
useToken(Token);
Lexer.Lex();
useToken(Token);

Since the number of tokens in the buffer is going to be small, usually 1, a SmallVector with pushing to head should be fine.

There should be no functional change.

I noticed some backends have copy/pasted the built in expression parser for similar reasons as above, they could possibly make use of this change.

This revision was automatically updated to reflect the committed changes.

What are the alternatives? This is a pretty nasty hack.

With the expression parsing example failing on "+ #" it seemed like the only alternative was to copy/paste the built-in expression parser and modify it to deal with this situation.

Another instance where this was helpful is in HexagonAsmParser::ParseInstruction. Previously we only got the string for the first token in a statement, this is because usually the first token is a mnemonic string which isn't true here. An option would be to modify AsmParser and not consume the token but a bunch of the logic around labels, directives, etc assumed the first token was consumed.

A lot of this is because different ASM grammars lex differently, I'm not sure having one lexer/parser for everything is sustainable. I'm open to ideas.

Would it be reasonable to instead add target hooks for things like this? Either specialized things like the comment character handling or something more general like the ability to override recognizing an integer literal or whatever.

I think that's a more ideal situation, in related commits I did add the ability for targets to specify additional tokens http://reviews.llvm.org/D14256

It seems like if things get much more complicated we're going to need a real, target specified parser and lexer produced by parser generator tools.

I'll keep looking at it though, maybe I can move things around and rewrite them in terms of peekTokens. Then again peekTokens could be rewritten in terms of this.