This is an archive of the discontinued LLVM Phabricator instance.

[WIP][GlobalISel][TableGen] Optimize MatchTable for faster instruction selection
ClosedPublic

Authored by qcolombet on Oct 17 2017, 7:47 PM.

Details

Summary

Hi,

This is a WIP, if people like it, I can add comments and stuff and push behind an experimental option (or just use it by default). This is the full diff, in particular, I've squashed the whole history to show the full picture.

  • Context ***

Right now, the table generated for matching instruction is straight forward but highly inefficient. Basically, each pattern generates its own set of self contained checks and actions.
E.g., TableGen generates:
First pattern
CheckNumOperand 3
CheckOpcode G_ADD
...
Build ADDrr
Second pattern
CheckNumOperand 3
CheckOpcode G_ADD
...
Build ADDri
// Third pattern
CheckNumOperand 3
CheckOpcode G_SUB
...
Build SUBrr

  • Problem ***

Because of that generation, a *lot* of check are redundant between each pattern and are checked every single time until we reach the pattern that matches.
E.g., Taking the previous table, let say we are matching a G_SUB, that means we are going to check all the rules for G_ADD before looking at the G_SUB rule. In particular we are going to do:
check 3 operands; PASS
check G_ADD; FAIL
; Next rule
check 3 operands; PASS (but we already knew that!)
check G_ADD; FAIL (well it is still not true
; Next rule
check 3 operands; PASS (really!!)
check G_SUB; PASS (at last :P)

  • Proposed Solution ***

This patch introduced a concept of group of rules (GroupMatcher) that share some predicates and only get checked once for the whole group.

As a proof of concept this patch only creates groups with one nesting level, but really there is nothing preventing it to do more.

In particular, for the given example it would generate:
// First group
CheckOpcode G_ADD

// First pattern
CheckNumOperand 3
...
Build ADDrr
// Second pattern
CheckNumOperand 3
...
Build ADDri

// Second group
CheckOpcode G_SUB

// Third pattern
CheckNumOperand 3
...
Build SUBrr

But it could also create a sub group for the checknumoperand 3 (we would just need to add a OptimizeRules on the rules within a group).

  • Result ***

With only one level of nesting, the instruction selection pass was up to 4x faster. For instance, one instruction takes 500 checks, instead of 24k! With more nesting we could get in the tens I believe.

What do people think?

Cheers,
-Quentin

Diff Detail

Repository
rL LLVM

Event Timeline

qcolombet created this revision.Oct 17 2017, 7:47 PM

For the record, what the grouping does is pretty much flattening the predicate class hierarchy. Long term, we may want to have a simpler class hierarchy.
Also, I have tried to same on memory comsumption and speed of tablegen, in particular, we duplicate operand indices and instruction identifier all over the place to make the checking of predicate correct. (isIdentical)

dsanders edited edge metadata.Oct 24 2017, 12:34 PM

There are a lot of changes in this patch so I haven't gone through the detail but this seems to be heading in the general direction I had in mind for when we got to optimizing the rules.

After importing and sorting the ruleset into priority order. I was thinking of having a collection of partitioners which would subdivide the ruleset into groups without changing that priority order. So for example, assuming the priority was this (it wouldn't be but ignore that for now):

(G_FNEG (G_FADD x, y))
(G_FADD x, y)
(G_FSUB x, y)
(G_FNEG x)

one partitioner would partition based on InstructionMatcher::Operands::size() like so:

(G_FNEG (G_FADD x, y))
---
(G_FADD x, y)
(G_FSUB x, y)
---
(G_FNEG x)

Even though the first and last rules only have one operand at the top level, they aren't grouped because of the priority of the G_FADD/G_FSUB in between. This way, we can ensure priority is being respected. The partitioner would then remove the relevant check from the rules and move them to the partition boundaries with appropriate GIM_Try's, much like you're doing in this patch. Each sub-ruleset would then be further partitioned forming a decision tree. Finally, when we run out of partitioners that produce >=2 partitions we generate the remaining rules in the ruleset as we do now.

Choosing which partitioner to apply at any given time is somewhat tricky. For this, I was thinking there would be two kinds of partitioner. Those that deal with structure (number of operands, opcodes?*, nested instructions, etc.) and those that deal with predicates. The structural ones would apply first, leaving the resulting sub-rulesets to test predicates using common instruction-id's and operand indices. For example:

(G_FADD x, simm16)
(G_FADD x, y)
(G_FNEG (G_FADD x, y))
(G_FNEG x)

would be divided by the structural partitioners like so

GIM_Try, ...
	GIM_CheckNumOperands, Insn0, 3,
	GIM_Try, ...
		GIM_CheckOpcode, Insn0, G_FADD,
		# (G_FADD x, simm16)
		# (G_FADD x, y)
GIM_Try, ...
	GIM_CheckNumOperands, Insn0, 2,
	GIM_RecordInsn, Insn1, Insn0, 1,
	GIM_Try, ...
		GIM_CheckOpcode, Insn0, G_FNEG,
		GIM_CheckOpcode, Insn1, G_FADD,
		# (G_FNEG (G_FADD x, y))
	GIM_Try, ...
		GIM_CheckNumOperands, Insn0, 2,
		# (G_FNEG x)

The G_FADD cases:

(G_FADD x, simm16)
(G_FADD x, y)

would then be further sub-divided by the predicate-based partitioners like so:

GIM_Try, ...
	GIM_CheckNumOperands, Insn0, 3,
	GIM_Try, ...
		GIM_CheckOpcode, Insn0, G_FADD,
		GIM_Try, ...
			GIM_CheckPredicate, Insn0, 2, simm16
			# (G_FADD x, simm16)
		GIM_Try, ...
			# (G_FADD x, y)
GIM_Try, ...
	GIM_CheckNumOperands, Insn0, 2,
	GIM_RecordInsn, Insn1, Insn0, 1,  // MIs[1] = getVRegDef(MIs[0].getOperand(1))
	GIM_Try, ...
		GIM_CheckOpcode, Insn0, G_FNEG,
		GIM_CheckOpcode, Insn1, G_FADD,
		# (G_FNEG (G_FADD x, y))
	GIM_Try, ...
		GIM_CheckNumOperands, Insn0, 2,
		# (G_FNEG x)

Within each set of partitioners, we'd need some heuristics to select an optimal one (and prevent the tree builder from getting too expensive). This is another reason for separating structural partitioners from predicate-based partitioners. The structural set have a more-or-less fixed order they can be applied (because they directly walk the tree) while still being near-optimal, then by the time we're onto the predicate-based ones (which have more freedom) we should have a small enough set to be a bit more thorough and evaluate which one produces the better tree. Somewhere in here, I also wanted to factor in the relative frequency of instructions so that rare instructions are checked last and frequent instructions require fewer checks. I was thinking this could be done as a tie-breaker in isHigherPriorityThan().

One thing the structual partitioners will want in order to maximize sharing of the structure predicates is a predictable numbering of instructions. A simple numbering scheme wouldn't work for variadic instructions so I think it's probably best to build the tree first and number the instructions afterwards. This is the reason instruction ID's were assigned during the emit*() functions and not stored in the matcher objects themselves.

*It's a bit debatable whether the opcode is structural or not. Technically you don't need it for structure but it seems wasteful to walk the def-use chain if the opcode is already going to prevent the match. On the other hand, if it's structural then (G_ADD (G_MUL x, y), z) and (G_SUB (G_MUL x, y), z) have different structure. It may be worth having a GIM_CheckOpcodeIsInSet or similar to balance the two extremes.

utils/TableGen/GlobalISelEmitter.cpp
594–597 ↗(On Diff #119422)

I don't understand why this is needed.

3053–3062 ↗(On Diff #119422)

This bit is correctness rather than optimization

For this, I was thinking there would be two kinds of partitioner. Those that deal with structure (number of operands, opcodes?*, nested instructions, etc.) and those that deal with predicates.

I would rather avoid the partitioners to be kind of "semantic-aware". If at all possible I would rather stick to simple concept like put "actions" that are identical together. On a different topic, the same apply for the matcher. Right now, we do the distinction between capturing, matching and so on, but all in all, this is just actions.

What I am saying is the design is more complicated than I would have thought and I wonder if on the long run it won't get in the way of modifying it... Unless it is well documented, in particular the intent :).

utils/TableGen/GlobalISelEmitter.cpp
594–597 ↗(On Diff #119422)

Admittedly, this is a hack around the fact that capturing a variable is not an action, but instead something that happen somewhat magically when we need to access something (defineInsnVar).

Now, how does this work:

  1. The "predicates" are self contained now, i.e., they reference all the InsnID and OpIdx they are going to access. In particular, they don't rely on a Rule at emission time to get this information (though the information needs to be in sync for this to work, thus the bunch of asserts added) Thanks to that, we can check that two predicates are really identical (as opposed to checking the same thing, but on different variables).
  2. When building the predicates we implicitly defines all the variables that are going to be accessed in Rule.
  3. We clean the implicit def in Rule (this method)
  4. When we emit Rule we check that the defined variables are equal to what we chose at build time for the predicate

If we skip #3, Rule is going to keep increasing the InsnVarID and the numbers we came up at build time won't reference the right thing.

Does it make sense? (look line 2842 for the call to that method)

3053–3062 ↗(On Diff #119422)

You're right, this is supposed to happen before the call to OptimizeRules.
Looks like I screwed up my squash :P.
I'll update the patch.

One thing to note: When you try a group, there is no way out. I.e., the assumption is that groups are mutually independent. This is actually why I haven't push for more nesting level, because the patch doesn't provide a way to check that this is true.

qcolombet updated this revision to Diff 120145.Oct 24 2017, 4:25 PM

Move back the sorting of the Rules in main function.
This is not part of the optimization process.

qcolombet marked an inline comment as done.Oct 24 2017, 4:27 PM

For this, I was thinking there would be two kinds of partitioner. Those that deal with structure (number of operands, opcodes?*, nested instructions, etc.) and those that deal with predicates.

I would rather avoid the partitioners to be kind of "semantic-aware". If at all possible I would rather stick to simple concept like put "actions" that are identical together. On a different topic, the same apply for the matcher. Right now, we do the distinction between capturing, matching and so on, but all in all, this is just actions.

What I am saying is the design is more complicated than I would have thought and I wonder if on the long run it won't get in the way of modifying it... Unless it is well documented, in particular the intent :).

I'm also trying to avoid semantic awareness but a certain amount of it is necessary on the structure side of things since we need to walk the tree (occasionally it's a DAG but predicates deal with that). For example, you need to record an instruction with GIM_RecordInsn before you can test predicates on the instruction or its operands. The current separation between capturing and everything else is intended to keep the table optimization simple since the artificial barrier avoids the need to worry about putting an opcode check before the instruction has been recorded.

On the structure side, there's limited scope for re-ordering in any case due to the need to walk the tree. However, at this level it's also easy to group rules that share the same structure but have different predicates. For example:

(G_FADD (G_FMUL a, b), c) // 1
(G_FSUB (G_FMUL a, b), c)  // 2
(G_ADD a, imm16) // 3
(G_SUB a, imm16) // 4
(G_MUL a, imm16) // 5
(G_ADD a, b) // 6
(G_SUB a, b) // 7
(G_MUL a, b) // 8
(G_FADD a, b) // 9
...

only has three distinct structures:

(X (Y a, b), c) // 1 and 2
(X c, (Y a, b)) // 1 (commutative G_FADD)
(X a, b) // 3 (including commutative), 4, 5 (including commutative), 6 (including commutative), 7, 8 (including commutative), and 9 (including commutative)

I'd expect the main structure partitioner to simply group rules with the same shapes together. This should be a big win optimization-wise since there's a small number of distinct shapes in the ruleset. Once you're past structure matching and have captured all the relevant instructions, there's no relevant semantics left to preserve and you're free to re-order the predicates as much as you like.

The artificial boundary between the semantic-aware structure and the semantic-ignorant predicates does come at a cost though. We limit the potential optimization slightly since we can't rule particular structures out based on predicates. I suspect we may eventually want to allow some predicates (e.g. a check for whether an opcode is in a particular set) to cross that boundary but doing so requires more semantic awareness.

As a side note, I'd eventually like to be able to have a GIM_BuildMI/GIM_MutateOpcode that can directly map input opcodes to output opcodes so that multiple rules can share the same actions. Maybe something like:

GIM_CheckOpcodeInSet ... G_ADD, G_SUB, 0
GIR_BuildMIWithMap, ..., G_ADD, ADDWrr, G_SUB, SUBWrr, 0,
... other actions ...

but I haven't given that much thought yet. The reason I'd like it is that there are an awful lot of near-identical binary rules (e.g. (G_ADD a, b)) that differ only by the opcodes involved.

One thing to note: When you try a group, there is no way out. I.e., the assumption is that groups are mutually independent. This is actually why I haven't push for more nesting level, because the patch doesn't provide a way to check that this is true.

That was true when I first introduced GIM_Try but it's not the case anymore. GIM_Reject is equivalent to a failed predicate and will resume processing from the label specified by the GIM_Try. If there is no active GIM_Try then it will give up matching entirely.

It's handled by these lines in InstructionSelectorImpl.h:

if (handleReject() == RejectAndGiveUp)
  return false;

If there is no active GIM_Try then it will give up matching entirely.

My bad, you're totally right. That means that I could theoretically already push for more than one level of nesting in the matching table without worrying about rejecting rules that would match later.
That's good to know!

Do you have any other comments on the code itself or should I push my changes? (With clean-ups like more comments and tests!)

dsanders added inline comments.Dec 1 2017, 5:27 PM
utils/TableGen/GlobalISelEmitter.cpp
533 ↗(On Diff #120145)

This function is a bit confusing. Going by it's name, I think it's trying to detect whether the Predicate is already in the Conditions array but the code only checks the last condition (which is enough given how we generate the table)

700 ↗(On Diff #120145)

predicates().begin() could be predicate_begin() which is more direct. Also this will need to be guarded against release builds otherwise some of the bots will object to the unused variable

719–731 ↗(On Diff #120145)

What should be the relative priority of the IPM_* and OPM_* predicates. They act on different things so I don't think there's a meaningful one but I also don't think we should have an artificial one.

2338 ↗(On Diff #120145)

I think we should have InstructionMatcher::addPredicate forward its ID onto the predicates that are added to it

2390 ↗(On Diff #120145)

Similarly, OperandMatcher::addPredicate should forward the Instruction ID and Operand Index to the predicates added to it

3052 ↗(On Diff #120145)

OptimizeRules -> optimizeRules

Also, I think this code will break if it ever manages to hoist a predicate that uses InsnID>0 or any OpIdx. There's currently no code to ensure an appropriate GIM_RecordInsn or GIM_CheckNumOperands has occurred before the hoisted predicate. At the moment, we're only managing to hoist the GIM_CheckOpcode for InsnID=0 which is always safe so we're getting away with it for now.

Here's roughly how I think we should go about fixing this:

  • We should prepare the way for GroupMatcher by changing the following:
    • RuleMatcher should have a PredicateListMatcher
    • Feature bit tests should be handled by a PredicateMatcher in the RuleMatcher
    • InstructionMatcher and OperandMatcher should not have a PredicateListMatcher. Everything that was there should move to the RuleMatcher except for InstructionOperandMatcher's which should become a simple member of the OperandMatcher
  • Then we should introduce the GroupMatcher which should have an array of InstructionMatchers (just like RuleMatcher) to define the structure, and a PredicateListMatcher to define the conditions. We should distribute rules into groups based on the shape of the InstructionMatcher/OperandMatcher structure while preserving priority order. After grouping them, RuleMatcher should not hold any representation of structure.
  • Finally, for each nested GroupMatcher we should do what this code currently does

The overall effect will be that the top-level GIM_Try's will define the structure and emit the relevant GIM_RecordInsn's and GIM_CheckNumOperands. Deeper levels of GIM_Try's will only test conditions. Later on, we can then start breaking up the structure level into finer grain groupings and/or improve the intelligence of the grouping of the condition levels.

That said, it would be good to make some progress towards saner tables quickly. We could just adapt this code to intentionally hoist only the opcode for InsnID == 0 and improve things from there.

3054 ↗(On Diff #120145)

StorageGroupMacther -> StorageGroupMatcher

3365–3367 ↗(On Diff #120145)

This doesn't look right. It's removing the whole instruction matcher if its first operand matcher runs out of predicates. There may be other operand matchers left which do have predicates.

594–597 ↗(On Diff #119422)

If we're going to put InsnID/OpIdx in the matchers, why not use those and skip step 3 and 4?

I'll upload a new version on Monday.

utils/TableGen/GlobalISelEmitter.cpp
533 ↗(On Diff #120145)

Yeah, the name is not great (suggestion welcome :)), but the code does exactly what I wanted to. The incentive was to keep the building of groups cheap and the idea was if two rules don't have the same order in their predicates, they are not going to be grouped together. Although this is a weakness of the current grouping (in other words, it won't group thing that may have been commuted between two rules), the assumption is that the predicates are sorted in the same relative order and thus it should be that relevant.

> that's intentional :)

700 ↗(On Diff #120145)

Both good catches!

719–731 ↗(On Diff #120145)

I mainly followed what I believe was the layout of the table: Instruction then Operand. (General to Special.)

2338 ↗(On Diff #120145)

Good point!
I made the change mechanically, I didn't notice the pattern.

2390 ↗(On Diff #120145)

Yep!

3052 ↗(On Diff #120145)

Given the record "directives" are not explicitly part of the predicate, I don't think we can mess them up. Indeed, we won't be able to hoist or reorder (which we don't do anyway) them, thus we cannot end up with a non-recorded use, right?

In other words, I don't think we need to restrict ourselves to InsnID == 0.
Am I missing something?

3052 ↗(On Diff #120145)

The plan makes sense to me. It matches what I have in mind when I said we should flat the hierarchy.

3054 ↗(On Diff #120145)

Good catch.

3365–3367 ↗(On Diff #120145)

Good catch, this check is supposed to check that OpMatcher was the last one!

594–597 ↗(On Diff #119422)

That's the long term plan, but it involves a bigger refactoring and I was not willing to pull that into this patch.
Is that fine with you?

(Admittedly, I was not planning to do that refactoring any time soon!)

fhahn added a subscriber: fhahn.Dec 3 2017, 2:49 PM
dsanders added inline comments.Dec 4 2017, 10:17 AM
utils/TableGen/GlobalISelEmitter.cpp
533 ↗(On Diff #120145)

lastConditionMatches()? Alternatively, the caller could use Predicate.isIdenticalTo(group.conditions_back())

719–731 ↗(On Diff #120145)

I think we'll get better tables if opcodes are allowed to have low priority. For example, binary operations can check for reg <op> reg => reg, and reg <op> imm => reg before testing the opcode.

That said, looking at it again it's not a problem since PredicateMatcher doesn't define a isHigherPriorityThan(). They're in OperandPredicateMatcher and InstructionPredicateMatcher so the two families are never compared.

Could you add to the comment saying the relative priority of IPM_* and OPM_* doesn't matter?

3052 ↗(On Diff #120145)

Suppose you've hoisted a couple checks and have some rules in a group of the form:

GIM_Try,
  GIM_CheckOpcode, /*InsnID*/0, G_ADD
  GIM_Try
    GIM_RecordInsn, /*InsnID*/1, ...
    GIM_CheckOpcode, /*InsnID*/1, G_MUL
    ...
  GIM_Try
    GIM_RecordInsn, /*InsnID*/1, ...
    GIM_CheckOpcode, /*InsnID*/1, G_MUL
    ...

Both rules have the same leading check, so that check is also hoisted by this code to get:

GIM_Try,
  GIM_CheckOpcode, /*InsnID*/0, G_ADD
  GIM_CheckOpcode, /*InsnID*/1, G_MUL    // <-- InsnID 1 has not been defined by a GIM_RecordInsn yet
  GIM_Try
    GIM_RecordInsn, /*InsnID*/1, ... // <-- Was emitted by emitCaptureOpcodes()
      ...
  GIM_Try
    GIM_RecordInsn, /*InsnID*/1, ...
    ...

At the moment the only thing that seems to prevent this is that all the rules differ in the second predicate so it doesn't check whether the GIM_CheckOpcode for InsnID 1 can be hoisted.

594–597 ↗(On Diff #119422)

That makes sense to me but we should follow up with the refactor asap. In the mean-time, we should comment on the requirement that the numbering comes out the same and explain the plan to fix it.

qcolombet added inline comments.Dec 4 2017, 4:00 PM
utils/TableGen/GlobalISelEmitter.cpp
3052 ↗(On Diff #120145)

Good catch. I mis-remembered that record insn stuff was also emitted as part of the emitPredicates thing.
Do you think I should add a emitCaptureOpcodes as part of the GroupMatcher as well?

Anyhow, for now, I will just bail out on InsnID != 0 or use OpIdx.

qcolombet updated this revision to Diff 126205.Dec 8 2017, 1:55 PM
qcolombet marked 5 inline comments as done.
  • Address nitpicks
  • Push InsnID and OpIdx in addPredicate
  • Bail on sharing when InsnID != 0 or OpIdx is specified
  • Make NumOperands a member variable instead of computed from the size of the list
qcolombet marked an inline comment as done.Dec 8 2017, 1:56 PM
qcolombet added inline comments.
utils/TableGen/GlobalISelEmitter.cpp
3365–3367 ↗(On Diff #120145)

Actually, I've checked the implementation of InstructionMatcher::pop_front and it only does what I intended: removing the first operand.

Talking with Daniel offline there was something else to fix: getNumOperands, so that the related check remains correct (it was looking at the size of this list).

dsanders accepted this revision.Dec 11 2017, 2:07 PM

LGTM. We need to follow up with the refactor to eliminate clearImplicitMap() though. We ought to follow up with the simplification of PredicateListMatcher too. The templating isn’t very useful anymore now that predicates have a common base class

utils/TableGen/GlobalISelEmitter.cpp
756 ↗(On Diff #126205)

Some of this predates this patch but we seem to have three naming conventions for operations affecting Matchers:

  • insnmatchers_front
  • matchers_empty
  • pop_front

We should pick one, my preference is for the first two. Of those two I have a slight preference for insnmatchers_front

811 ↗(On Diff #126205)

This will crash if predicates_begin() == predicates_end()

920 ↗(On Diff #126205)

This seems to have undone the TiedTo -> MatchingName change I was asked to make when I committed SameOperandMatcher

3365–3367 ↗(On Diff #120145)

Sorry about that.

There’s also the operand index which uses the position within this list

594–597 ↗(On Diff #119422)

I dont think the comment has been added

This revision is now accepted and ready to land.Dec 11 2017, 2:07 PM
qcolombet added inline comments.Dec 11 2017, 2:47 PM
utils/TableGen/GlobalISelEmitter.cpp
756 ↗(On Diff #126205)

Good catch.

811 ↗(On Diff #126205)

We check this case in the previous statement, right?

920 ↗(On Diff #126205)

Bad merge when I rebased I suppose. Will fix.

3365–3367 ↗(On Diff #120145)

Okay, looking.

594–597 ↗(On Diff #119422)

I missed that one. Thanks.

dsanders added inline comments.Dec 11 2017, 2:58 PM
utils/TableGen/GlobalISelEmitter.cpp
811 ↗(On Diff #126205)

Scratch this one. We check it a little earlier, I just didn’t notice because it was off screen

qcolombet updated this revision to Diff 126869.Dec 13 2017, 4:33 PM
qcolombet marked 7 inline comments as done.
qcolombet added inline comments.
utils/TableGen/GlobalISelEmitter.cpp
3365–3367 ↗(On Diff #120145)

Seems like InstMatcher::getOperand would already do the right thing (it checks that getOperandIndex == OpIdx) and AFAICT the uses of OperandMatcher::getOperandIndex are fine.
What else should I check?

This revision was automatically updated to reflect the committed changes.
dsanders added inline comments.Dec 18 2017, 11:51 AM
utils/TableGen/GlobalISelEmitter.cpp
3365–3367 ↗(On Diff #120145)

Sorry, forgot to come back to this.

There's the emit*() functions but this patch already rewrites the code that might be affected so I think we're fine.