This is an archive of the discontinued LLVM Phabricator instance.

[AsmParser] Generalize matching for grammars without mnemonic-lead statements.
ClosedPublic

Authored by colinl on Nov 2 2015, 12:15 PM.

Details

Summary

Right now one of the assumptions from tablegen and the asm matcher is that instruction statements are lead by a mnemonic. Hexagon doesn't follow this assumption and even trivial instructions like a register transfer

r0 = r1

Breaks tablegen when generating match tables.

Diff Detail

Repository
rL LLVM

Event Timeline

colinl updated this revision to Diff 38963.Nov 2 2015, 12:15 PM
colinl retitled this revision from to [AsmParser] Generalize matching for grammars without mnemonic-lead statements..
colinl updated this object.
colinl added reviewers: sidneym, mcrosier.
colinl set the repository for this revision to rL LLVM.
colinl added a subscriber: llvm-commits.

This change will increase the size of the match table Classes array by 1 entry per instruction.

Future work could be done to remove the mnemonic entry from MatchEntry and instead use the match class in the Classes array.

This revision was automatically updated to reflect the committed changes.
craig.topper edited edge metadata.Dec 29 2015, 8:19 PM

I know this revision went in a while ago, but this seems like a really terribly performing option for Hexagon. It does a linear scan through the entire match table (which at the time of this writing is 2160 entries) for any instruction that starts with a non-token. There appear to be about ~1500 such mnemonic-less entries in table so about 3/4 of the entire table.

I don't care when it went in Craig, it went in without any review at all.

Colin, please revert this immediately and ping the thread for a proper
review. You've been around long enough to know that this kind of behavior
(posting for review, clearly having concerns about it, and then landing
without any review happening) is completely unacceptable.

Since Hexagon tests have become dependent on this and I'm not sure how feasible it is to revert the later dependencies. I've added a flag in r256669 to at least recover the binary size on all the other targets.

While unfortunate, since Colin decided to commit this without review, he
and the Hexagon team will have to bear the cost of reverting every
dependency on this behavior that they committed without any review or
approval from the community for the requisite change to the broader LLVM
infrastructure.

Hey guys, I'm sorry about any issues that were created by not waiting for review on these patches, I'll definitely work through whatever we need to clean things up.

It looks like there are ~15 commits in the AsmMatcherEmitter that build on these patches in various ways so I'll have to step carefully for the revert.

Hey everyone, I have a revert up for review which undoes the changes in this three series of patches. http://reviews.llvm.org/D15874

In the mean time here's an expanded rationale for the changes that were made.

A couple regressions:
size_t to unsigned and string::find to StringRef::find due to merge issues. Even though rebasing before committing was performed, the noise due to trying to maintain 80 column formatting both in source and it generated code complicated the merge and these two issues were missed. My apologies, I'll try to be more careful.

MatchTable size:
The match table is an N x M array of match entries where N is the number of instructions and M is the number of operands for the instruction with the largest number of operands in the entire instruction set. Instructions that use fewer operands than the longest instruction are value-initialized as unspecified array elements. The issue is if the number of operands for the longest instruction changes, the entire array is resized to N x (M + 1). By adding the initial mnemonic to the match table we hit this case which increased the binary size of the tools themselves.

r256669 was created to address this issue
An alternative I would have suggested in a code review for this patch would be to remove Mnemonic from OperandMatchEntry and rewrite the binary search that uses LessOpcodeOperand to use OperandMatchEntry.Classes[0] This should address the binary size issue without introducing the HasMnemonicFirst flag and still allowing instructions without a leading mnemonic.

matchTokenString function size
This function converts known strings to an enum value via a nested switch table. By adding mnemonics to the generated match set we're indeed increasing the size of this function. If we remove OperandMatchEntry.Mnemonic and use Classes[0] we'd be making use of these code paths, otherwise it's true we could optimize this and only add strings that were actually needed for target that have HasMnemonicFirst=0.

Linear search
Right now the match loop narrows down the match table by the first entry, which is assumed to be a string, and does a linear search within these entries to find a match. When expanding to a generalized search the first impulse is to try and perform a binary search on matches though there are some issues.

Some backends don't allow some combinations of operand fitting tests to be performed, they have asserts in some branches that aren't hit with a linear search but are hit with a binary search. See ARMAsmParser::checkTargetMatchPredicate which has some asserts when certain checks don't pass despite the fact that the matcher loop is designed to survive match predicate failures with "if ((MatchResult = checkTargetMatchPredicate(Inst)) != Match_Success)"

Some backends test specific error messages out of the assembler, see noneon-diagnostics.s. Since the matcher loop tracks the last instruction tested and prints diagnostics for this instruction sometimes, these tests would need to be changed.

The fundamental issue with binary searching the match entries has to do with the fact that we can't generate a total order for all operand fits. Consider the register classes
ClassA{reg1, reg2}, ClassB{reg1, reg3}, ClassC{reg2, reg3}. There's no way to sort these register classes such that valid entries are in a contiguous region which makes binary searching not work. I think the ARM backend had register classes in this category that tripped up the matcher. Also consider the immediate classes s8, s16, u8, u16 when trying to match -1 and 257. There's no way to sort these immediate classes such that there's a contiguous match region.

If someone can propose an efficient way to do this searching I'd be interested, it'd be even more interesting if it didn't involve rewriting all the assembler backends.

I'm wondering whether its a good idea to revert the changes that added the
new ways to break tokens. Clearly many targets had to adapt to that since
you disconnected MnemonicContainsDot. There's at least one out of tree
target that was using that flag. Reverting will create churn on these out
of tree targets. I don't know that I have particularly big concerns with
that change so we may be able to leave that in. Chandler, thoughts?

As for the change the matching tables. I think the big issue here is that
you're trying to generalize a design that fundamentally only scales
reasonably well if it can rely on the binary searching for the initial
mnemonic to reduce the linear scan search space.

Does the Hexagon assembly language have a defined grammar that would lend
itself to a different kind of parser? Presumably for most instructions the
mnemonic is in the line somewhere? Can we come up with a way to locate this
mnemonic in the line and key the matching to that?

colinl added a comment.Jan 4 2016, 5:12 PM

I agree about the scalability of the match table. Part of the goal with this design was to impact other targets as little as possible. If a design required a rewrite of all targets it would have probably been a non-started as far as getting Hexagon parsing working. The change preserved the existing behavior of string match + small linear search if the first actual parsed operand was a string. This was to have no performance impact for matching on existing targets.

Was the majority of the issue with these changes related to the tool code size increase? I obviously miscalculated the impact since I thought the impact to the size would be minimal and it would only affect machines which are running the assembler, presumably machines with sufficient resources to deal with one extra operand.

The Hexagon instructions can be parsed easily with a parser generator that accepts EBNF style definitions. The biggest issue is that we have separate parsing and matching loops. We do one pass through the statement to parse tokens in to operands and then a second pass to match operands to instructions.

Fixing this would require either slicing these loops differently or overhauling the matcher with something that generates more generalized matching grammars. There's a lot of code in various targets dedicated to pasting tokens together or splitting tokens apart so making this type of change seems like it would require all targets to transition which was what I was trying to avoid.

Let me know your thoughts, I'm open to suggestions.

colinl added a comment.Jan 5 2016, 8:34 AM

Most but not all instructions have mnemonics e.g. "r0 = r1". Lots have what appear to be a mnemonic but actually isn't "if (p0) r0 = add(r1, r2)" a large swath of instructions start with 'if' but that's not actually the mnemonic. The version of the parser we had internally before this version did as you suggested, it searched for a string token in the operand list, added a dummy operand to the end and sliced the head operands off and put them after the dummy value. The asm parser did a similar thing and once matched, it would go back through the MCInst and reorder the operands to the correct position.

This required what I thought was an extremely nasty hack to AsmMatcherEmitter and confusing code in the parser and I rewrote it to the current version contained in these patches.

For reference before the r256669 the binary size of all LLVM tools in release mode with a MinGW build was 78,331,196 and the size with the patch is 77,467,273, approximately 1%. Is the binary size difference of the tools what spawned the investigation?

Does the list have opinions on the severity of this type of issue?

I didn't initially notice the binary size. I open the X86GenAsmMatcher.inc
and noticed the new classes in the table while looking at something else.
This led me to investigate the history of the change which is when I found
out it was not reviewed. And that it increased the table size on all
targets but Hexagon with no benefit to those targets.

Further review indicated that this appears to be a suboptimal algorithm
implementation for the Hexagon target anyway due to the linear scan. Do you
have any large Hexagon asm files that you could measure parsing performance
of with this linear scan vs the previous operand reordering hack you had
mentioned?

I hadn't run a benchmark before though I made a file with 1 million wait(r0) instructions, this is the last entry in the table so it should trigger the worst case. The file amounts to 10mb and assembles to an object around 4mb. For comparison I made an x86 file with 1 million "mov eax, cs" instructions.

New parser:

colinl@COLINL1 ~/llvm-master.ninja
$ time ./bin/llvm-mc.exe -triple=hexagon-unknown-elf -filetype=obj huge.s -o huge.o

real 0m6.627s
user 0m0.000s

sys 0m0.031s

Old parser:

colinl@COLINL1 ~/BRANCH_7_2.ninja
$ time ./bin/llvm-mc.exe -triple=hexagon-unknown-elf -filetype=obj ../llvm-master.ninja/huge.s -o huge.o

real 0m8.421s
user 0m0.016s

sys 0m0.015s

X86 parser:

colinl@COLINL1 ~/llvm-trunk.ninja
$ time ./bin/llvm-mc.exe -triple=i386 -x86-asm-syntax=intel -filetype=obj -o huge.o huge.s

real 0m1.989s
user 0m0.015s

sys 0m0.015s

The current parser exceeds the speed of our old parser and in the worst case the Hexagon parsing seems to be 3x slower than mnemonic lead targets which are all other targets.

Is that the full instruction "wait(r0)"? Wouldn't the "wait" be considered
a mnemonic for the Mnemonic.empty() check in the match function which would
send it through the binary search?

Hah, you're right, good catch. I changed the test to "r0=mpy(r0.l, r0.l):<<1:sat" and it definitely went sub-optimal.

colinl@COLINL1 ~/llvm-master.ninja
$ time ./bin/llvm-mc.exe -triple=hexagon-unknown-elf -filetype=obj huge.s -o huge.o

real 8m8.593s
user 0m0.000s
sys 0m0.031s

colinl@COLINL1 ~/BRANCH_7_2.ninja
$ time ./bin/llvm-mc.exe -triple=hexagon-unknown-elf -filetype=obj huge.s -o huge.o

real 0m32.186s
user 0m0.015s
sys 0m0.015s

I think the easier part of getting the mnemonic searching technique working is in AsmParser where we're presented with the actual operands and could search through them with a target-specific hook and present the parser with a mnemonic in a canonical way.

I think the harder part is getting tablegen to produce a match table that comes up with the same mnemonic given what's parsed from the .td files. In our internal version we placed this code directly in the TI code in tablegen because we were only enabling the hexagon target; this would obviously be unacceptable. It seems like tablegen in AsmMatcherEmitter.cpp would need to a way to execute target specific code on each parsed instruction text to retrieve what the mnemonic should be.

I'm not aware of a place where TableGen does this type of thing already to use as a reference, any ideas?

Any idea why Hexagon was so much slower than x86 even when it using a
binary search for "wait(r0)"?

I ran a profile on it and it seems like the extra time comes from two things we do called checking and shuffling. Checking verifies the sub instructions in the VLIW instruction satisfy architecture requirements. Shuffling allows the assembly writer to write instructions in a bundle in any order which the assembler correctly orders in binary form.

Is the runtime complexity on those 2 things particularly bad? Why are do
they amount to so much of the time?

The cost is higher than simply matching though it could be improved from where it is right now. One of the things we're focusing on is making sure the codebase is in sync with the community. We want to work on performance improvements though if we're not in sync our efforts will be wasted.

From what I can see the closest idea we have to improving the parsing performance without a major rework is to figure out how to get tablegen to pick a mnemonic in a target dependent way instead of simply the first operand in the instruction. Right now tablegen doesn't link with any target code; I don't see a path to do this cleanly.

Having the Hexagon asm parser be slow in some cases isn't desirable though I think it's less desirable to have no parser at all. I ran a benchmark against a more real world module instead of the worst case synthetic module, with a 450mb assembly file at ~11million lines it took 96 seconds; maybe it can be improved but the degenerate case seems to not be the usual case.

It would be nice if there was at least one other target that needed this type of change to justify more of a target independent revamp but as it stands it would be a large amount of rework to fix what is mostly an edge case on one target. We want to make sure none of the hexagon stuff will negatively impact the target independent code and from what I see with the change you put in to remove the mnemonic entry from the match table, the target independent impact has been address.

Do we think there are outstanding issues that justify following through on reverting?
Do we think the suboptimal performance of hexagon asm parsing on synthetic tests justifies a larger target independent rework prior to accepting the parser?

In my view I see the hexagon asm parsing performance as adequate for the moment.