The late machine instruction combiner may replace an instruction
sequence by combined instruction(s) when it is beneficial to do so. It provides
the infrastructure to evaluate instruction combining patterns like mul+add->madd
based on machine trace information. Currently the DAG Combiner greedily
generates combined instructions, which usually is a win for code size, but
unfortunately can cause performance losses. To remedy this the new pass changes
the logic from always generate the combiner instruction(s) to only do so when it
is beneficial.
Details
- Reviewers
Gerolf
Diff Detail
Event Timeline
Hi,
I have no plans for x86, but the design is supposed to make adding other targets easy. Please review the code from that angle and let me know of any issue. I will follow up on these aspects as part of this review.
To support a new target you have to add header with pattern enums, provide the implementation of the combiner interface and disable combinations in the DAG combiner.
Cheers
Gerolf
Hi Gerolf,
This looks like a widely applicable change, and great for OOO cores but I have no data for this. Do you happen to have some (how frequent does the optimization fire in lnt, how much does it shorten the critical path, etc)?
Some comments about the design:
Ideally the instruction selection algorithm would only need the already existing DAG patterns and the MachineTraceMetrics to be able to make these decisions.
The target specific code for your use case seems to be inferable from the existing selection patterns.
Would it be possible to derive the new selection patterns from the already existing tablegen patterns? I suspect the answer would be yes. for the simpler cases, while the more complex ones would require custom handling.
Even if the new patterns are not derivable from the existing ones, I think there is an opportunity to generate the new selection code through tablegen. This might require some changes to the infrastructure, but in principle this would reduce the amount of code required to add support for a new instruction. With the tablegen changes, the size of the AArch64 code would be 2 lines in a .td file. I think your use case is not unique, so doing this would reduce the amount required to do further changes.
Hope this helps!
Cheers,
Silviu
lib/Target/AArch64/AArch64InstrFormats.td | ||
---|---|---|
1356 | Doing this will disable MADD/MSUB generation for in-order cores (for example Cortex-A53). Maybe guard this with a predicate? |
FWIW Silviu's comments precisely mirror my own that I just hadn't had a chance to work up yet. :)
-eric
We could have a target specific option like always combine. In principle that should have the same effect.
Gerolf
I certainly agree with everyone else, it would be really nice to generate these replacements using TableGen. I can think of two possible ways of doing this:
- For all existing patterns of non-trivial complexity, look at how each of the individual pieces would match. Collect those instructions in a corresponding tree and match against that tree to find the instruction with a higher-complexity pattern.
- Specify separate special "output->output" TableGen patterns, and generate the mapping from those.
If we really believe that this is the "right" way to match instructions with complex pattern inputs, then we should probably try for (1) and use this to completely replace the existing complex-pattern-matching infrastructure (for everything except immediates or other "free" operands).
include/llvm/Target/TargetInstrInfo.h | ||
---|---|---|
570 | All potential pattern a listed -> All potential patterns are returned | |
572 | Why is this restricted to binary instructions? | |
582 | make the call -> decide | |
584 | (Likewise, why binary?) | |
588 | old instruction including Root that could -> old instructions, including Root, that could | |
lib/CodeGen/MachineCombiner.cpp | ||
121 | This seems like an unnecessary restriction. Why don't you use a SmallVector? | |
144 | If you use a SmallVector, you can replace the 16 here with InstrDepth.size(). | |
239 | You don't need to repeat this comment. | |
241 | This loop is the same as the previous one, please make this a function. | |
285 | What "original code" are you referring to? Do you mean the code in DAGCombine? | |
285 | replace -> replaced | |
301 | This ordering should be mentioned in the header where genAlternativeCodeSequence is declared. | |
lib/CodeGen/MachineTraceMetrics.cpp | ||
1250 | Exactly what are you proposing? | |
lib/Target/AArch64/AArch64InstrInfo.cpp | ||
686 | Is this a separable (or unrelated) change? |
Changes:
- Added bool alwaysCombine() (Target/TargetInstrInfo.h) so targets
can decide to always replace a given pattern. This should be equivalent to
the current code in DAGCombine when a given pattern is disabled.
- InstrDepth is now a small vector (MachineCombiner.cpp)
- Added helper function instr2instrSC (MachineCombiner.cpp)
- Improved comments as suggested by reviewers
Just some minor issues that I've noticed (comments inlined). Otherwise looks good.
Cheers,
Silviu
include/llvm/Target/TargetInstrInfo.h | ||
---|---|---|
609 | The insertMove method isn't used anywhere. Could it be removed? | |
lib/CodeGen/MachineTraceMetrics.cpp | ||
1228 | This duplicates (almost) the code above. I think these should be merged. | |
1247 | This FIXME comment now seems confusing since you already added the fix. | |
test/CodeGen/AArch64/aarch64-neon-mul-div.ll | ||
82 ↗ | (On Diff #11483) | How are the div tests related to this change? |
Minor changes:
a) Cleaned up tests in test/CodeGen/AArch64: arm64-neon-mul-div.ll,
dp-3source.ll and mul-lohi.ll. The only change to a critical path
length is in the single block of mul-lohi.ll: generating mul-add
instead of madd shortens cpl from 16 to 8 cycles. Removed previous
tests since they were too elaborate for this commit.
b) Added capture extraCycles() in MachineTraceMetrics.cpp
c) Removed insertMove() in TargetInstInfo.h
d) Comments/Format changes
Hi Gerolf,
A few comments inline and a general question:
It seems like the matching infrastructure is very verbose on the backend level. What alternatives do we have to make this smaller? Perhaps something ala the existing pattern match infrastructure we have for IR? Something else? It looks even more painful than DAGCombines at the moment to add things :)
Thanks!
-eric
lib/CodeGen/MachineCombiner.cpp | ||
---|---|---|
114 | Seems to be a very long function... could you break it up into computing the path and then making the determination? | |
216 | Can use a typedef or a SmallVectorImpl instead of writing the number all over the place. Should help with formatting. | |
283 | "beneficial" | |
320 | This conditional is a little hard to read - some way to break it up/hoist things out? | |
326 | Extra space. | |
lib/Target/AArch64/AArch64InstrInfo.cpp | ||
2103 | The name here is a bit limiting. What if you want to combine something else in the future? Same with the rest of these helpers. | |
2172 | Unnecessary cast? | |
2191 | "All potential patterns are..." | |
2195 | This code looks like it could be written ala include/llvm/IR/PatternMatch.h? | |
2304 | Documentation for these functions describing the incoming variables, constraints, etc. | |
lib/Target/AArch64/AArch64InstrInfo.h | ||
158 | "All potential patterns are..." |
- Minor cleanups + added comments.
- New functions getDepth() and getLatency() in MachineCombiner.cpp
to simplify preservesCriticalPathLen().
- New function doSubstitute() to make a conditional in combineInstructions()
more readable
Thank you for the good suggestions! The target-specific part is not perfect yet and needs to evolve in stages. Please also see below.
-Gerolf
Ah, I just noticed that a header file is missing (AArch64MachineCombinerPattern.h).
Otherwise I have no objections to an initial commit, provided that everyone is happy with this as well.
Cheers,
Silviu
lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
17 | This header file is missing from the review. |
Uups, thanks for catching this! Just added the header file to phabricator.
Cheers
Gerolf
Looks like I spoke too soon (new comment inlined). I think there are some issues with targets without a machine model.
- Silviu
lib/CodeGen/MachineCombiner.cpp | ||
---|---|---|
419 | Would it be better to always combine if there is no machine model (for example in case the cpu is not specified)? If we bail out here, we would no longer generate the patterns that are handled by this pass. |
Hi Silviu,
I think that your suggestion is great! Every target should have the opportunity to invoke the machine combiner.
Thanks,
Gerolf
PS:
if (!TSchedModel.hasInstrSchedModel() && !TII->alwaysCombine()) {
DEBUG(dbgs() << " Skipping pass: no machine model available\n"); return false;
}
Update: Allow any target to invoke the machine combiner,
even when it does not support a instruction scheduling
model.
Hi Gerolf,
As is, the madd/msub instructions still won't get generated unless the target hook gets implemented.
I don't see why any target would choose not to generate these instructions when no machine model is
present, and in this case we can have something more general here.
I was thinking more along the lines of changing doSubstitute to check if there is a machine model and
if there isn't return true. The tricky part here would be making sure that everything actually works ok
(the pass doesn't require a machine model when there isn't one).
Also, I think a regression test should be added for this (making sure that at madd/msub instructions are
still being generated).
Thanks,
Silviu
Hi Silviu,
thanks for following up on this. I think we are on the same page wrt to the overall goal, but still have a different understanding of the code. Even with my changes madd/msub instructions are generated by the DAGCombiner *unless* a target *chooses* not to do so and instead invokes the MachineCombiner, which decides on combining based on a cost function. For AArch64 the code that disables the pattern in DAGCombiner is in lib/Target/AArch64/AArch64InstrInfo.td. There is no change to other targets. With the alwaysCombine() routine the target has the means to decide to always combine in the MachineCombiner. The code now ensures this can be done independent of the existence of a machine model.
Cheers
Gerolf
Thanks, this all makes sense to me. However, I think there is still an issue with the default
AArch64 target.
I'll first give a summary:
I've ran the mul-lohi.ll test without a -mcpu flag (with and without this change) and with this change
we don't get the madd instructions, as would have been the case otherwise. What seems to happen
is that the pattern won't be selected during instruction selection because the patterns have been disabled
and the MachineCombiner pass bails out because there is no machine model.
So with the current change to AArch64, using the defaults we will not get madd/msub instructions
because nothing can generate them.
This seems to be an unintended effect and I don't think it's an improvement.
I agree that it is possible for someone to go and do the work to get the behaviour for the generic
AArch64 target back. Is there any reason why we wouldn't make sure that this is change is a nop
in this case now? It seems like the right thing to do.
Silviu
Support for madd/msub generation when no machine model is available
(but the target supports MachineCombiner).
Added test case test/CodeGen/madd-lohi.ll
Thanks Silviu + Eric + Hal. Very much appreciate your feedback + reviews!
Cheers
Gerolf
All potential pattern a listed -> All potential patterns are returned