This is an archive of the discontinued LLVM Phabricator instance.

[AsmParser] Backends can parameterize ASM tokenization.
AbandonedPublic

Authored by colinl on Nov 2 2015, 12:12 PM.

Details

Summary

This allows backends to specify how to lex their instructions through variables in the .TD files.

Several backends need to do special things with the dot character and Hexagon contains numerous additional characters that need to be tokenized #()=:.<>!+*.

Diff Detail

Event Timeline

colinl updated this revision to Diff 38962.Nov 2 2015, 12:12 PM
colinl retitled this revision from to [AsmParser] Backends can parameterize ASM tokenization..
colinl updated this object.
colinl added reviewers: sidneym, mcrosier.
colinl set the repository for this revision to rL LLVM.
colinl added a subscriber: llvm-commits.
colinl abandoned this revision.Nov 8 2015, 8:15 PM

252439

And here Colin is yet another commit you have landed without review.

You seem to have been landing multiple MC changes without getting any
review or comment from anyone else working on MC despite mailing them out
for pre-commit review. What on earth is going on here?

Looking at the diff, this also seems to effectively revert multiple changes
made to the AsmParser code in the last few months. I assume you're trying
to upstream patches from some internal branch, but you *really must*
actually merge the trunk changes in!

Yet again, please revert this, and every other patch dependent on it if
necessary, and every other patch you are landing outside of Hexagon without
bothering to respect our code review process.

-Chandler

I see the size_t loop variable was mistakenly changed to an unsigned.

Were there other reverted changes I should make note of? My hope was for this patch to result in no functional change by not changing any tests and ensuring the all passed after the patch.

craig.topper edited edge metadata.Jan 4 2016, 11:05 AM
craig.topper added a subscriber: craig.topper.

It also changed from StringRef::find back to std::find

echristo added a subscriber: echristo.

How do you see this as working. The changes you made to the existing ports don't show how this would be useful.

What's going on here? I think there have been enough concerns over this patch set that you might want to temporarily revert and send out a design doc for what you want to change.

-eric

There is a review for the revert: D15874.

The change motivation is because TableGen and the AsmParser recognize different token sets which caused them to tokenize in fundamentally different ways. Before this change if TableGen was presented with "if(p0) r0 = r0" it would tokenize it as { "if(p0)", "r0", "=", "r0" } and the AsmParser would tokenize it as { "if", "(", "p0", ")", "r0", "=', "r0" } Before this patch the set of tokens that are part of identifiers was fixed by the tokenizeAsmString switch table shared by all targets. The first inclination is to add the parenthesis tokens to the switch table but this causes issues in other targets, for instance in X86 the ST(0) register has a hard time being parsed with this change which means different targets have mutually exclusive tokenization rules hence the allowance for targets to specify their tokens.

The existing tokenizeAsmString did three different variants of tokenization. It discarded certain separator characters "a,b" -> { "a", "b" } It would tokenize certain characters "a$b" -> { "a", "$", "b" } and it did a weird one-off break of the dot character but only sometimes and would concatenate it to the following identifier "a.b" -> { "a", ".b" } The change allowed these to be parameterized by each target, something we were already kind of doing for the dot break case with "MnemonicContainsDot" which this patch replaced and something that needed to be done for the mutually exclusive parsing rules.

Merge mistakes aside I haven't heard a proposal or volunteer to make larger design changes to the assembly parser.