This is an archive of the discontinued LLVM Phabricator instance.

[globalisel][tablegen] Partially fix compile-time regressions by converting matcher to state-machine(s)
ClosedPublic

Authored by dsanders on Jun 1 2017, 12:06 AM.

Details

Summary

Replace the matcher if-statements for each rule with a state-machine. This
significantly reduces compile time, memory allocations, and cumulative memory
allocation when compiling AArch64InstructionSelector.cpp.o after r303259 is
recommitted.

The following patches will expand on this further to fully fix the regressions.

Event Timeline

dsanders created this revision.Jun 1 2017, 12:06 AM

Replace the matcher if-statements for each rule with a state-machine. This
significantly reduces compile time, memory allocations, and cumulative memory
allocation when compiling AArch64InstructionSelector.cpp.o after r303259 is
recommitted.

My guess is that the compiler compile time goes down simply because tablegen produces far less code?
Do you also have an insight of how using a state machine "interpreter" changes the run-time of the compiler itself compared to using the "directly-compiled" generated C++ code?
I can't predict really how the different tradeoffs in both approaches will affect the compiler's run-time performance in the end, so it would be nice to share insights you may have for this.

Also, if we end up using a state machine for this, I think it'd be a good idea to, after the design settles, write a blog post or something similar to explain how the state machine works, as I've heard from many people that reverse engineering the DAGISel-produced state machine was hard due to limited or no documentation on it.

Replace the matcher if-statements for each rule with a state-machine. This
significantly reduces compile time, memory allocations, and cumulative memory
allocation when compiling AArch64InstructionSelector.cpp.o after r303259 is
recommitted.

My guess is that the compiler compile time goes down simply because tablegen produces far less code?

That's right. Instead of generating C++ and having the compiler optimize a single massive function, it's compiling a single copy of each predicate and chaining them together with a table.

Do you also have an insight of how using a state machine "interpreter" changes the run-time of the compiler itself compared to using the "directly-compiled" generated C++ code?
I can't predict really how the different tradeoffs in both approaches will affect the compiler's run-time performance in the end, so it would be nice to share insights you may have for this.

I haven't measured it yet since the priority so far has been to get the compile times back down to a reasonable level. Once I've finished posting the patches for review I'll take a look at the run-time effect.

My intuition is that the compiler was unlikely to have optimized the previous implementation very well (particularly since it didn't inline the single-use lambdas) and assuming that's the case, the overhead of the interpreter is likely to be limited to the cost of the table lookups. It's possible that it will turn out to be faster since the previous implementation had no chance of fitting into an instruction cache whereas the new code may fit on some machines. On machines where it doesn't fit, some paths are more common than others so it may still be able to use the instruction cache better than the old implementation did.

Also, if we end up using a state machine for this, I think it'd be a good idea to, after the design settles, write a blog post or something similar to explain how the state machine works, as I've heard from many people that reverse engineering the DAGISel-produced state machine was hard due to limited or no documentation on it.

Ok.

Replace the matcher if-statements for each rule with a state-machine. This
significantly reduces compile time, memory allocations, and cumulative memory
allocation when compiling AArch64InstructionSelector.cpp.o after r303259 is
recommitted.

My guess is that the compiler compile time goes down simply because tablegen produces far less code?

That's right. Instead of generating C++ and having the compiler optimize a single massive function, it's compiling a single copy of each predicate and chaining them together with a table.

Do you also have an insight of how using a state machine "interpreter" changes the run-time of the compiler itself compared to using the "directly-compiled" generated C++ code?
I can't predict really how the different tradeoffs in both approaches will affect the compiler's run-time performance in the end, so it would be nice to share insights you may have for this.

I haven't measured it yet since the priority so far has been to get the compile times back down to a reasonable level. Once I've finished posting the patches for review I'll take a look at the run-time effect.

My intuition is that the compiler was unlikely to have optimized the previous implementation very well (particularly since it didn't inline the single-use lambdas) and assuming that's the case, the overhead of the interpreter is likely to be limited to the cost of the table lookups. It's possible that it will turn out to be faster since the previous implementation had no chance of fitting into an instruction cache whereas the new code may fit on some machines. On machines where it doesn't fit, some paths are more common than others so it may still be able to use the instruction cache better than the old implementation did.

Also, if we end up using a state machine for this, I think it'd be a good idea to, after the design settles, write a blog post or something similar to explain how the state machine works, as I've heard from many people that reverse engineering the DAGISel-produced state machine was hard due to limited or no documentation on it.

Ok.

Thanks, that all makes sense.
I'm afraid I'll have to skip the detailed review of this and leave that to someone more up to speed with design of this part of GlobalISel.

ab added inline comments.Jun 16 2017, 10:37 AM
include/llvm/CodeGen/GlobalISel/InstructionSelector.h
65–73

Are all three opcodes necessary? It seems like we only need one, that defines an ID from an operand.

include/llvm/CodeGen/GlobalISel/InstructionSelectorImpl.h
31

Should this be unsigned?

Also, is int64_t necessary? Seems like almost everything should fit in 16 bits. X86 is close to 15k opcodes, but we can complain when we emit something that doesn't fit. As for CheckInt, we can just encode the 64-bit value across 4 operands.

test/TableGen/GlobalISelEmitter.td
88

Add CHECK lines for ComplexPredicates, TypeObjects, Renderers? It's not crucial for testing, but it's helpful for reading.

utils/TableGen/GlobalISelEmitter.cpp
2082–2091

Instead of a hardcoded list, MVTToLLT whatever can be converted in MVT::all_valuetypes() ?

dsanders added inline comments.Jun 20 2017, 6:29 AM
include/llvm/CodeGen/GlobalISel/InstructionSelector.h
65–73

I think there's a potential argument for merging GIM_RecordInsn and GIM_PushInsnOpDef. They're currently separate because I wasn't completely convinced that all visited instructions needed to be recorded. I'm leaning towards all of them being recorded but I decided to come back to that once the compile-time regression was resolved.

GIM_PopInsn needs to be kept separate for the current DAG walking code. Without it we wouldn't be able to walk something like:

(G_MUL (G_ADD a, b), (G_ADD c, d))

since once you descend into either G_ADD you wouldn't be able to visit the other. That said, I agree that tweaking GIM_PushInsnOpDef so that it takes a source instruction ID would achieve the same effect.

Are you ok with me working on this change as an optimisation after the compile-time regression fixes are committed? Currently all my work is bottlenecked on that.

include/llvm/CodeGen/GlobalISel/InstructionSelectorImpl.h
31

Should this be unsigned?

It needs to be signed at the moment but the reason isn't a particularly good one: GIM_CheckInt takes a signed integer immediate straight from the table.

Also, is int64_t necessary?

For most things no, but it's necessary for GIM_CheckInt at the moment. Narrowing the type is a size optimization that I decided to leave for after the compile-time issue was resolved since all the tablegen work is bottlenecked on it.

utils/TableGen/GlobalISelEmitter.cpp
2082–2091

This is another size optimization I left for later. I'd like this table to be created based on the calls to LLTCodeGen's constructor. This will make the table much smaller for most targets since many of these types are only needed on X86.

rovka edited edge metadata.Jun 22 2017, 5:41 AM

Not for this patch, but I feel like some of the names and comments (e.g. emitCxxCaptureStmts) are starting to drift a bit from the implementation. Maybe at the end of this patch stack you could look into brushing them up?

include/llvm/CodeGen/GlobalISel/InstructionSelector.h
65–73

That looks like a good test case to add :)

I'm still not convinced that you need a stack though, can't you just record all instructions (as you said you are inclined to do anyway) and express everything in terms of InsnIDs?

168

Maybe put the Renderers, TypeObjects etc into a struct, so we don't have to change this signature every time we want to add or modify a state.

If not, at least consider using ArrayRef's for them.

test/TableGen/GlobalISelEmitter.td
141

Nitpick: "RegClassID, " (just so it matches the others).

utils/TableGen/GlobalISelEmitter.cpp
256–262

Please add a comment here explaining how this should be used and why it's different from defineInsnVar.

267

This doesn't seem to be caused by this patch, but the indentation here and below looks funny.

2018

Um... return A->getName() < B->getName() ?

2136

I would add a static helper for generating these names, so it's easier to keep in sync with the use in RuleMatcher::emit.

vitalybuka added a subscriber: vitalybuka.
ab accepted this revision.Jun 27 2017, 11:48 AM

With the non-optimization comments addressed, I'm OK with this.

This revision is now accepted and ready to land.Jun 27 2017, 11:48 AM
dsanders updated this revision to Diff 104652.Jun 29 2017, 7:55 AM
dsanders marked 9 inline comments as done.

Merged GIM_RecordInsn, GIM_PushInsnOpDef, GIM_PopInsn to a single kind of node: GIM_RecordInsn
Bundled the matcher state into one object. AvailableFeatures isn't part of this because it doesn't quite fit.
Bundled the helper tables into one object.

Fixed several nits.

TODO: Rebase to ToT.

dsanders added inline comments.Jun 30 2017, 3:38 AM
include/llvm/CodeGen/GlobalISel/InstructionSelector.h
65–73

I've merged the three opcodes into a single one and added a test like the one above (I used G_SUB on all three nodes to avoid commutativity).

There's a small optimization that could be applied to the table as a result of this but it's a job for another patch. If one rule causes us to record the instruction that def's MIs[1]->getOperand(2), then no subsequent rule needs to do it again.

168

I've made the following changes to this:

  • Removed MIStack as part of another change.
  • Combined MIs and Renderers into a mutable State object
  • Combined TypeObjects, FeatureBitsets, ComplexPredicates into a constant MatcherInfo object.

AvailableFeatures is still separate because I don't think it quite fits into either group.

test/TableGen/GlobalISelEmitter.td
88

I've added checks to the start of the file for them.

141

Well spotted.

utils/TableGen/GlobalISelEmitter.cpp
256–262

It was supposed to be different but the implementations in this patch ended up the same. I must have forgotten to make the change I intended.

I've corrected the implementations and added a comment.

267

I'm not sure what happened here but I've fixed it in this patch.

2018

It's a lexicographic comparison rather than a pointer comparison. The StringRef returned by Record::getName() provides relational operators that compare the strings (similar to std::string)

dsanders updated this revision to Diff 105104.Jul 3 2017, 9:58 AM

Rebase before commit

I don't seem to be able to commit right now:

svn: E200029: Commit failed (details follow):
svn: E200029: Couldn't perform atomic initialization

I'll try again later.

dsanders closed this revision.Jul 4 2017, 7:35 AM