This is an archive of the discontinued LLVM Phabricator instance.

[RFC][GlobalISel] Replace the current GlobalISel matcher with a bottom-up matcher
Needs RevisionPublic

Authored by Kai on Jan 6 2023, 6:51 AM.

Details

Summary

The GICombiner can match an instruction pattern, but in the current implementation this feature is basically
not used. One reason is that the matching capability is somewhat limited. Let's look at some examples.

I can easily match a G_ICMP/G_ZEXT/GSUB sequence:

(match (G_ICMP $icmp, $cc, $srcb, $srcc),
       (G_ZEXT $zext, $icmp),
       (G_SUB $dst, $srca, $zext))

Just change G_SUB to G_ADD:

(match (G_ICMP $icmp, $cc, $srcb, $srcc),
       (G_ZEXT $zext, $icmp),
       (G_ADD $dst, $srca, $zext))

and it does not work as expected, because the current implementation does not handle commutable instructions like the C++ mi_match(). In order to get the desired outcome, you have to add another combine rule, just with the $srca and $zext operand swapped. Obviously, just using C++ code seems easier.

Even more annoying, turning to a simple tree pattern like

(match (G_AND $srca, $t1, $t2),
       (G_AND $srcb, $t3, $t4),
       (G_OR $dst, $srca, $srcb))

just crashes TableGen.

The proposal is to replace the current matcher implementation with a bottom-up matcher.

The basic idea of a bottom-up matcher is to associate each instruction the set of matching patterns, called the match set. For an instruction without use operands (a leaf) the match set is easily determined. For any other instruction, the match set is retrieved via a table lookup, using the use operands as table indices. As a result, all matching patterns can be found in linear time.

Commutable instructions are handled by adding the variant patterns resulting from swapping the subpatterns. For the resulting code, the implication is that only the variable binding varies. The rule code is not duplicated.

This implementation is based in the algorithm from Chase. See https://dl.acm.org/doi/10.1145/41625.41640.
It is a drop-in replacement for the existing matcher, with no changes in the result.

Some numbers from Linux on Z (orig vs change applied):
Compiling LLVM is slightly longer with this change applied: 25m35.471s vs 28m51.797s
The resulting llc binary is slightly smaller: 145.899.472 bytes vs 145.849.704 bytes
I got similar results in Linux on Power.
I have not yet measured the impact of the resulting matcher on the compile time.

Update on the numbers:
Compiling on Linux on Z (that is, only llvm-tblgen is changed) is now (current/bottom-up) 24m32.520s vs. 25m29.169s

Running LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s

Size of the binaries:
Current implementation: llc 127.280.024, clang 217.540.416
Bottom-up implementation: llc 127.026.960, clang 217.291.456

One interesting point to note:
The current combiner implementation loops over the basic blocks of a machine functions until a fixpoint is reached. The bottom up matcher matches all patterns, so in theory a single loop of the instructions should be enough. However, there are 2 reasons why this is not implemented:

  • The last loop over the instructions removes trivially dead instructions, which would not happen if only a single loop over the instructions is done. This has no functional impact. The next combiner, or latest the instruction selector removes trivially dead instructions. However, lot of test case would need to be changed because they do not expect those dead instructions. A simple solution would be to make a second loop over the instructions, just removing the trivially dead instructions, which would be faster than doing a full iteration.
  • More important, not all patterns would be matched! The reason is that there are C++ matchers which perform a top-down match. One example is the rule fold_global_offset from the AArch64PreLegalizerCombiner. This rule matches a G_GLOBAL_VALUE only, but the C++ matcher looks at the uses of the defined value, which basically means that the matcher reverses the order in which matches are made. As a result, the bottom-up matcher is not aware that it has to apply another rule. I think this case can be fixed by changing the matching pattern. The question is if this is desirable, because it would add restrictions to what a C++ matcher is allowed to do.

There are some open todos/ideas to discuss:

  • Measure runtime of the matcher
  • The information required for tests is print in YAML format. However, the output is not yet the same on all platforms. The current solution is to re-assign the encoding in a deterministic way. This code is responsible for the larger part of the increased compile time, and should be removed.
  • The old matcher generator implementation is still there, but not used.
  • Adding the new parameter to tryCombineAll() should possibly be a separate PR. I think that introducing a new class CombinerState, which is just a thin wrapper around a DenseMap<MachineInstr *, unsigned>, would also make sense.

Diff Detail

Event Timeline

Kai created this revision.Jan 6 2023, 6:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2023, 6:51 AM
Kai requested review of this revision.Jan 6 2023, 6:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2023, 6:51 AM
Kai retitled this revision from [RFC][GlibalISel] Replace the current GlobalISel matcher with a bottom-up matcher to [RFC][GlobalISel] Replace the current GlobalISel matcher with a bottom-up matcher.Jan 6 2023, 6:52 AM
arsenm added a comment.Jan 6 2023, 7:00 AM

Compiling LLVM is slightly longer with this change applied: 25m35.471s vs 28m51.797

This is huge. I don't understand this part. Surely you don't have bootstrapped globalisel working? Why this slowdown?

However, lot of test case would need to be changed because they do not expect those dead instructions. A simple solution would be to make a second loop over the instructions, just removing the trivially dead instructions, which would be faster than doing a full iteration.

We can also just update tests. Avoiding test changes isn't a good reason to do something

arsenm added a comment.Jan 6 2023, 7:01 AM

Adding the new parameter to tryCombineAll() should possibly be a separate PR. I think that introducing a new class CombinerState, which is just a thin wrapper around a DenseMap<MachineInstr *, unsigned>, would also make sense.

Like D81899?

arsenm added inline comments.Jan 6 2023, 7:08 AM
llvm/lib/CodeGen/GlobalISel/Combiner.cpp
155

can just directly print *MI

164–165

Should move construction of these out of the loop too?

llvm/lib/Target/AArch64/GISel/AArch64O0PreLegalizerCombiner.cpp
75

Should use a typedef/using for the set type

llvm/utils/TableGen/GlobalISel/GIMatchTreeAutomaton.cpp
71

Avoid relative include paths

171

ArrayRefs should be passed by value

354

!empty()

363

Single quotes here and elsewhere for single character printing

673–679

Can this just be a standalone function if you're passing in the two items explicitly anyway?

752

Typo DAAGRoot

809

Don't need llvm::

979

Hoist construction out of loop?

llvm/utils/TableGen/GlobalISel/GIMatchTreeAutomaton.h
30

Should avoid relative include paths

Kai added a comment.Jan 6 2023, 7:59 AM

Compiling LLVM is slightly longer with this change applied: 25m35.471s vs 28m51.797

This is huge. I don't understand this part. Surely you don't have bootstrapped globalisel working? Why this slowdown?

The slowdown comes from llvm-tblgen and is due to the sorting at the end of createMatchSets(). This is currently required to get a stable encoding for testing. I currently work on changing the test structure, with the goal to remove the sorting completely.
Another reason for the slowdown is that my implementation still calculations to much. I always loop over all representer sets, but only sets added in the last round can actually add another set. This can clearly improved.

I have not yet measured bootstrapping with globalisel. The only reason for that is that I struggle a bit with setting up a cross-compile environment. I guess aarch64 is the best architecture to test the impact.

Kai added a comment.Jan 6 2023, 8:02 AM

We can also just update tests. Avoiding test changes isn't a good reason to do something

Of course. It just makes the change even bigger.

Kai added a comment.Jan 6 2023, 8:15 AM

Adding the new parameter to tryCombineAll() should possibly be a separate PR. I think that introducing a new class CombinerState, which is just a thin wrapper around a DenseMap<MachineInstr *, unsigned>, would also make sense.

Like D81899?

I think this does not go far enough. To be able to look over the boundary of a basic block the state must be available for the while MachineFunction.
Therefore I store the state in the Combiner, and pass it down to CombinerInfo::combine. An alternative would be to add a new member to CombinerInfo. This is just for the life time requirement, the usage is only in the tryCombineAll(), but I have not yet found a better solution.

Kai updated this revision to Diff 487095.Jan 7 2023, 8:02 AM

I restructured the YAML output used for the test. As a result I could remove the sorting code, which caused a longer running time of the algorithm. The construction of the matcher is much faster now. However, testing got more challenging. I also made some minor changes in other parts of the algorithm. The generated matcher code is unchanged..

lkail added a subscriber: lkail.Jan 8 2023, 6:58 PM

Note: I didn't quite review the algorithm/logic in itself (I'll come back to it later when I'm sure I understand it fully to make meaningful comments), so I focused mostly on code style and general comments

The old matcher generator implementation is still there, but not used.

I would add a follow-up diff to remove it entirely, else I get the feeling it will stay there a very long time

The information required for tests is print in YAML format. However, the output is not yet the same on all platforms. The current solution is to re-assign the encoding in a deterministic way. This code is responsible for the larger part of the increased compile time, and should be removed.

Could it be a separate diff?

I also share the worries about compilation time, it would be nice to have some more detailed measurement (ideally both on AMDGPU and AArch64) and maybe make some additional optimizations before this gets the green light.
(IMHO) Since this is a pretty big change I'd like to make sure it doesn't affect compilation times negatively, as a recurring theme in LLVM is slow compilation times and we don't want to make huge steps backwards.

llvm/include/llvm/CodeGen/GlobalISel/GISelWorkList.h
115–118
llvm/lib/CodeGen/GlobalISel/Combiner.cpp
132

The class doesn't look like it repairs anything so the name could be better. If I understand correctly it builds a worklist from a list of changed instructions, adding the "parent" instructions as well. Maybe it should just be named TreeWorklistBuilder or something like that? I feel like it could also just be a pair of recursive function and the class itself adds no value.

Lastly, as this recursively adds all "parent" instructions I'm afraid it could "explode" in some cases and add almost all instructions. Is that possible?. Can you maybe look for optimizations/opportunities to skip some instructions? If there is none/they are rare, please add some comment explaining why we need to aggressively add everything

134

nit: I would move typedefs to the top and not mix them with variable declarations for readability

155

nit: I would split it in two lines

IMO debug statements like this are only useful with added context, so I would also print something before the loop runs, e.g. "Running TreeRepairer on N changed instructions". It would also be nice after each iteration to print how many instructions were added to the worklist. It could make it easier to spot cases where this "explodes" and greatly increases the number of instructions added to the worklist.

164–165

If we're always passing a new set, can we just not pass it by reference to WorklistMaintainer and instead let it create it & use an acccessor to retrieve it later?

llvm/utils/TableGen/GICombinerEmitter.cpp
731–734
742
744–747

It's one statement but more than one physical line so IIRC we want braces here

757–758

small nit: I'm not sure I like for loops without the increment part, could this just be a while loop?

llvm/utils/TableGen/GlobalISel/GIMatchTreeAutomaton.cpp
19
27
34
84

Unnecessary as all your code is inside namespace llvm

86

Are those TO-DO's to be done before this diff lands, or will they stay?

If they're here to stay, use // instead of /* */: https://llvm.org/docs/CodingStandards.html#comment-formatting

113

A bit scary to read and it would be useful to elaborate. Does it leak? Is it inefficient in terms of space (allocates too much) or time (e.g. chains of owned ptrs need to be freed and it takes time) ?

138

Could this (& its derived class) move to a separate header? (or just GIMatchTreeAutomaton.h)
This file is already pretty big and takes time to understand, having a whole datastructure definition in there doesn't help

158

Why int here but unsigned in other places? (same above)

187

Ditto - newline after } and before next function

203
262
331–332

nit: I would avoid 2 letter variable names like those. Maybe just use OpenParen/CloseParen?

645
660

Typedef this to make it less verbose?

805–807
814

Hmm, this works but feels like a weird use for "for". Perhaps a while loop is better here?

822–824
970
973
1000–1002
1147–1149
llvm/utils/TableGen/GlobalISel/GIMatchTreeAutomaton.h
106
112
128

small nit: Can you add a blank line between the } and the next function definitions to make it easier to read?

187

nit: comment doesn't add anything, it just repeats the variable name. I would either remove it or elaborate.

190

Ditto

191

If the key here is a number that starts at a fixed index and is always incremented by 1, you could just use a vector instead of a map

Kai updated this revision to Diff 487460.Jan 9 2023, 8:59 AM
Kai edited the summary of this revision. (Show Details)
  • Visit of changed instructions in the Combiner is now done iteratively.
  • The array with the rules to execute in the generated source is a bit more compact.

Running LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s

Size of the binaries:
Current implementation: llc 127.280.024, clang 217.540.416
Bottom-up implementation: llc 127.026.960, clang 217.291.456

Therefore I store the state in the Combiner, and pass it down to CombinerInfo::combine. An alternative would be to add a new member to CombinerInfo. This is just for the life time requirement, the usage is only in the tryCombineAll(), but I have not yet found a better solution.

I think putting in CombinerInfo would be better

llvm/include/llvm/CodeGen/GlobalISel/CombinerInfo.h
70

I think this need to be a using/typedef

llvm/lib/CodeGen/GlobalISel/Combiner.cpp
106
llvm/utils/TableGen/GICombinerEmitter.cpp
59

I'd assume compact would be the default?

llvm/utils/TableGen/GlobalISel/GIMatchTreeAutomaton.cpp
117

Don't need llvm namespace?

337

single quotes

358

Single quotes

366

Single quotes

818

Typo missmatch

llvm/utils/TableGen/GlobalISel/GIMatchTreeAutomaton.h
29

Should still avoid relative include paths

  • Visit of changed instructions in the Combiner is now done iteratively.
  • The array with the rules to execute in the generated source is a bit more compact.

Running LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s

Size of the binaries:
Current implementation: llc 127.280.024, clang 217.540.416
Bottom-up implementation: llc 127.026.960, clang 217.291.456

I haven't had time to properly look at the implementation, but a question on this. When you say "running the test suite", are you talking about building the test suite with -fglobal-isel? Not running the generated code right?

Secondly, you mentioned that this is a drop on replacement for the existing matcher, yet we're seeing some code size differences. Do you know why?

Kai added a comment.Jan 12 2023, 7:12 AM
  • Visit of changed instructions in the Combiner is now done iteratively.
  • The array with the rules to execute in the generated source is a bit more compact.

Running LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s

Size of the binaries:
Current implementation: llc 127.280.024, clang 217.540.416
Bottom-up implementation: llc 127.026.960, clang 217.291.456

I haven't had time to properly look at the implementation, but a question on this. When you say "running the test suite", are you talking about building the test suite with -fglobal-isel? Not running the generated code right?

Yes, for the time measurement I referred to building the test suite, as this shows the impact of my code on the compile time. (I also ran the test suite to make sure that there are no regressions.)

Secondly, you mentioned that this is a drop on replacement for the existing matcher, yet we're seeing some code size differences. Do you know why?

Yes. The matcher my code generates has a simpler structure. It is basically:

  • Calculate the match set number for the current instruction, which is either a constant assignment or a table lookup.
  • Translate the match set number to a list of rules to execute, which is another table lookup.
  • Run the rules

The current matcher has a couple of if and switch statements before the rules are run. I noted that the generated tables are much more compact than the code generated from if and switch. The overall price my implementation pays is a greater construction time in llvm-tblgen.

I don't have time for a proper review at the moment as I'm still dealing with a lot of downstream problems but I have a few comments

The basic idea of a bottom-up matcher is to associate each instruction the set of matching patterns, called the match set. For an instruction without use operands (a leaf) the match set is easily determined. For any other instruction, the match set is retrieved via a table lookup, using the use operands as table indices. As a result, all matching patterns can be found in linear time.

Aside from the 'bottom-up' part, that's pretty consistent with where I wanted to end up. I was pretty worried about the 'bottom-up' part as there's some top-down and middle-out combines that were making better decisions with their visiblity of all the uses after applying the rule but you mentioned that those are still available later on. I definitely don't want to lose that capability

One thing the original matcher was aiming to do was to to merge rules that were very similar but diverged near the root of the match. The intent was that given multiple rules with the same structure and predicates but different opcodes, it could still match the commonality and only diverge at the point it needs to. I guess one concrete example of that is ((x << y) >> y) -> x & M or sext(x, y) depending on whether the shift right is logical or arithmetic. The matchers used by DAGISel and GISel's instruction selector essentially treat these as two fully independent matchers as they test for the G_LSHR/G_ASHR opcodes first, but those rules could share common matching code right up to the last check. Even if we don't make use of it at the moment, it would be good if we can make this matcher change without making a future change along those lines impossible.

At one point I had a draft algorithm for handling multiple mutually exclusive choices based on their relative improvement on the code but the way things have gone over the last couple years I'm never going to find time to implement that. There are some things worth thinking about from it though. Most of those mutually exclusive choices arise from our frequent use of hasOneUse(). Upstream uses it a bit but downstream we use it a lot. This predicate is a proxy for what we really want to be checking which is "will this value become unused after the combine?". Essentially, it's a test that the combine is going to save on work rather than break even or increase it. For example, fadd + fmul -> fma doesn't save anything if the fmul result is needed for something else. It would be good if it were possible for "will be unused" tests to account for more than one applied combine. For example, maybe multiply-add and multiply-sub combine match the same mul so it has two uses both of which will go away after the combines. Ideally we'd apply both rather than blocking them because each rule sees it doesn't eliminate all the uses by itself.
Side note: An unresolved problem in this area is that DBG_VALUE counts as a use but hasOneNonDbgUse() can get a bit expensive.

Running LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s

I see Amara asked if for clarification on this and it's just compile time. Was runtime similar too?

More important, not all patterns would be matched! The reason is that there are C++ matchers which perform a top-down match. One example is the rule fold_global_offset from the AArch64PreLegalizerCombiner. This rule matches a G_GLOBAL_VALUE only, but the C++ matcher looks at the uses of the defined value, which basically means that the matcher reverses the order in which matches are made. As a result, the bottom-up matcher is not aware that it has to apply another rule. I think this case can be fixed by changing the matching pattern. The question is if this is desirable, because it would add restrictions to what a C++ matcher is allowed to do.

I'm certain there's more cases like that. FWIW, we try not to do that in our target but there's occasions where doing it allows better decisions.

The information required for tests is print in YAML format. However, the output is not yet the same on all platforms. The current solution is to re-assign the encoding in a deterministic way. This code is responsible for the larger part of the increased compile time, and should be removed.

It's not important but I'm a little sad to lose the graphviz output. I found being able to render the matcher in dot pretty useful for the early debugging

The old matcher generator implementation is still there, but not used.

If the code is no longer reachable then we should remove it. That said, I would expect this change to have a fairly big effect on our tests and perf so we might be glad of a temporary escape route when it reaches our downstream repo to avoid blocking further intake of upstream commits until we've sorted out the consequences on our target. As a result, I'd say the old one should be kept for the moment but be deleted a couple weeks after this.

  • Visit of changed instructions in the Combiner is now done iteratively.
  • The array with the rules to execute in the generated source is a bit more compact.

Running LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s

Size of the binaries:
Current implementation: llc 127.280.024, clang 217.540.416
Bottom-up implementation: llc 127.026.960, clang 217.291.456

I haven't had time to properly look at the implementation, but a question on this. When you say "running the test suite", are you talking about building the test suite with -fglobal-isel? Not running the generated code right?

Yes, for the time measurement I referred to building the test suite, as this shows the impact of my code on the compile time. (I also ran the test suite to make sure that there are no regressions.)

Secondly, you mentioned that this is a drop on replacement for the existing matcher, yet we're seeing some code size differences. Do you know why?

Yes. The matcher my code generates has a simpler structure. It is basically:

  • Calculate the match set number for the current instruction, which is either a constant assignment or a table lookup.
  • Translate the match set number to a list of rules to execute, which is another table lookup.
  • Run the rules

The current matcher has a couple of if and switch statements before the rules are run. I noted that the generated tables are much more compact than the code generated from if and switch. The overall price my implementation pays is a greater construction time in llvm-tblgen.

Oh I thought the llc/clang binaries were different because of the codegen being different. If it's because of implementation of the combiner itself being different then it's fine.

Matt added a subscriber: Matt.Jan 17 2023, 11:02 AM

From my reading of this review, it seems no one has a fundamental disagreement with the overall direction. In that case, I think it's ready to go into a conventional implementation review.

llvm/utils/TableGen/GlobalISel/GIMatchTreeAutomaton.cpp
89

dimension

113

Yes, if this leaks (even harmlessly in the case of llvm-tblgen) it could break bots running valgrind etc.

862

Please document more what you mean by Chase and HOD sets and why this mapping is needed. It wasn't obvious to me that HOD here meant "Hoffman and O'Donnell".

869–870

std::fill()?

Hi,
Just checking in - is this something that you're still interested in landing?

I was excited to see the GISel TableGen backend getting some love, especially since it enables better MIR pattern matching (which I'm interested in) so I was looking forward to this eventually landing.
Of course no pressure, after all GISel isn't the default so this isn't a priority - I was just wondering what the state of this diff was :)
Thanks

Pierre-vh requested changes to this revision.Jul 19 2023, 6:40 AM

(Marking as request changes to remove this from review queues)
See https://discourse.llvm.org/t/rfc-matchtable-based-globalisel-combiners/71457/13 which is another implementation that makes this unfortunately obsolete.

This revision now requires changes to proceed.Jul 19 2023, 6:40 AM
llvm/test/TableGen/GICombinerEmitter/match-tree-automaton-1.td