The GICombiner can match an instruction pattern, but in the current implementation this feature is basically
not used. One reason is that the matching capability is somewhat limited. Let's look at some examples.
I can easily match a G_ICMP/G_ZEXT/GSUB sequence:
(match (G_ICMP $icmp, $cc, $srcb, $srcc), (G_ZEXT $zext, $icmp), (G_SUB $dst, $srca, $zext))
Just change G_SUB to G_ADD:
(match (G_ICMP $icmp, $cc, $srcb, $srcc), (G_ZEXT $zext, $icmp), (G_ADD $dst, $srca, $zext))
and it does not work as expected, because the current implementation does not handle commutable instructions like the C++ mi_match(). In order to get the desired outcome, you have to add another combine rule, just with the $srca and $zext operand swapped. Obviously, just using C++ code seems easier.
Even more annoying, turning to a simple tree pattern like
(match (G_AND $srca, $t1, $t2), (G_AND $srcb, $t3, $t4), (G_OR $dst, $srca, $srcb))
just crashes TableGen.
The proposal is to replace the current matcher implementation with a bottom-up matcher.
The basic idea of a bottom-up matcher is to associate each instruction the set of matching patterns, called the match set. For an instruction without use operands (a leaf) the match set is easily determined. For any other instruction, the match set is retrieved via a table lookup, using the use operands as table indices. As a result, all matching patterns can be found in linear time.
Commutable instructions are handled by adding the variant patterns resulting from swapping the subpatterns. For the resulting code, the implication is that only the variable binding varies. The rule code is not duplicated.
This implementation is based in the algorithm from Chase. See https://dl.acm.org/doi/10.1145/41625.41640.
It is a drop-in replacement for the existing matcher, with no changes in the result.
Some numbers from Linux on Z (orig vs change applied):
Compiling LLVM is slightly longer with this change applied: 25m35.471s vs 28m51.797s
The resulting llc binary is slightly smaller: 145.899.472 bytes vs 145.849.704 bytes
I got similar results in Linux on Power.
I have not yet measured the impact of the resulting matcher on the compile time.
Update on the numbers:
Compiling on Linux on Z (that is, only llvm-tblgen is changed) is now (current/bottom-up) 24m32.520s vs. 25m29.169s
Running LLVM test suite with -fglobal-isel on an AArch64 EC2 instance (2 CPUs, 4GB):
Current implementation: 22m43.224s
Bottom-up implementation: 22m47.165s
Size of the binaries:
Current implementation: llc 127.280.024, clang 217.540.416
Bottom-up implementation: llc 127.026.960, clang 217.291.456
One interesting point to note:
The current combiner implementation loops over the basic blocks of a machine functions until a fixpoint is reached. The bottom up matcher matches all patterns, so in theory a single loop of the instructions should be enough. However, there are 2 reasons why this is not implemented:
- The last loop over the instructions removes trivially dead instructions, which would not happen if only a single loop over the instructions is done. This has no functional impact. The next combiner, or latest the instruction selector removes trivially dead instructions. However, lot of test case would need to be changed because they do not expect those dead instructions. A simple solution would be to make a second loop over the instructions, just removing the trivially dead instructions, which would be faster than doing a full iteration.
- More important, not all patterns would be matched! The reason is that there are C++ matchers which perform a top-down match. One example is the rule fold_global_offset from the AArch64PreLegalizerCombiner. This rule matches a G_GLOBAL_VALUE only, but the C++ matcher looks at the uses of the defined value, which basically means that the matcher reverses the order in which matches are made. As a result, the bottom-up matcher is not aware that it has to apply another rule. I think this case can be fixed by changing the matching pattern. The question is if this is desirable, because it would add restrictions to what a C++ matcher is allowed to do.
There are some open todos/ideas to discuss:
- Measure runtime of the matcher
- The information required for tests is print in YAML format. However, the output is not yet the same on all platforms. The current solution is to re-assign the encoding in a deterministic way. This code is responsible for the larger part of the increased compile time, and should be removed.
- The old matcher generator implementation is still there, but not used.
- Adding the new parameter to tryCombineAll() should possibly be a separate PR. I think that introducing a new class CombinerState, which is just a thin wrapper around a DenseMap<MachineInstr *, unsigned>, would also make sense.
I think this need to be a using/typedef