Page MenuHomePhabricator

Review for machine combiner pass
Needs ReviewPublic

Authored by Gerolf on Jul 2 2014, 10:26 PM.



The late machine instruction combiner may replace an instruction
sequence by combined instruction(s) when it is beneficial to do so. It provides
the infrastructure to evaluate instruction combining patterns like mul+add->madd
based on machine trace information. Currently the DAG Combiner greedily
generates combined instructions, which usually is a win for code size, but
unfortunately can cause performance losses. To remedy this the new pass changes
the logic from always generate the combiner instruction(s) to only do so when it
is beneficial.

Diff Detail

Event Timeline

Gerolf updated this revision to Diff 11041.Jul 2 2014, 10:26 PM
Gerolf retitled this revision from to Review for machine combiner pass.
Gerolf updated this object.
Gerolf edited the test plan for this revision. (Show Details)
Gerolf added a reviewer: Gerolf.
Gerolf added a subscriber: Gerolf.
OlegM added a subscriber: OlegM.Jul 3 2014, 2:47 AM
OlegM added a comment.Jul 3 2014, 6:29 AM

Do you have plans to enable this optimization as well for x86 FMA instructions?

Gerolf edited edge metadata.Jul 3 2014, 11:32 AM


I have no plans for x86, but the design is supposed to make adding other targets easy. Please review the code from that angle and let me know of any issue. I will follow up on these aspects as part of this review.

To support a new target you have to add header with pattern enums, provide the implementation of the combiner interface and disable combinations in the DAG combiner.


Hi Gerolf,

This looks like a widely applicable change, and great for OOO cores but I have no data for this. Do you happen to have some (how frequent does the optimization fire in lnt, how much does it shorten the critical path, etc)?

Some comments about the design:

Ideally the instruction selection algorithm would only need the already existing DAG patterns and the MachineTraceMetrics to be able to make these decisions.

The target specific code for your use case seems to be inferable from the existing selection patterns.
Would it be possible to derive the new selection patterns from the already existing tablegen patterns? I suspect the answer would be yes. for the simpler cases, while the more complex ones would require custom handling.

Even if the new patterns are not derivable from the existing ones, I think there is an opportunity to generate the new selection code through tablegen. This might require some changes to the infrastructure, but in principle this would reduce the amount of code required to add support for a new instruction. With the tablegen changes, the size of the AArch64 code would be 2 lines in a .td file. I think your use case is not unique, so doing this would reduce the amount required to do further changes.

Hope this helps!


silviu.baranga added inline comments.Jul 6 2014, 5:57 AM

Doing this will disable MADD/MSUB generation for in-order cores (for example Cortex-A53). Maybe guard this with a predicate?

echristo edited edge metadata.Jul 6 2014, 10:16 AM

FWIW Silviu's comments precisely mirror my own that I just hadn't had a chance to work up yet. :)


Gerolf added a comment.Jul 6 2014, 8:28 PM

We could have a target specific option like always combine. In principle that should have the same effect.


mcrosier removed a subscriber: mcrosier.Jul 7 2014, 7:24 PM
mcrosier added a subscriber: mcrosier.
hfinkel added a subscriber: hfinkel.Jul 8 2014, 3:41 PM

I certainly agree with everyone else, it would be really nice to generate these replacements using TableGen. I can think of two possible ways of doing this:

  1. For all existing patterns of non-trivial complexity, look at how each of the individual pieces would match. Collect those instructions in a corresponding tree and match against that tree to find the instruction with a higher-complexity pattern.
  2. Specify separate special "output->output" TableGen patterns, and generate the mapping from those.

If we really believe that this is the "right" way to match instructions with complex pattern inputs, then we should probably try for (1) and use this to completely replace the existing complex-pattern-matching infrastructure (for everything except immediates or other "free" operands).


All potential pattern a listed -> All potential patterns are returned


Why is this restricted to binary instructions?


make the call -> decide


(Likewise, why binary?)


old instruction including Root that could -> old instructions, including Root, that could


This seems like an unnecessary restriction. Why don't you use a SmallVector?


If you use a SmallVector, you can replace the 16 here with InstrDepth.size().


You don't need to repeat this comment.


This loop is the same as the previous one, please make this a function.


What "original code" are you referring to? Do you mean the code in DAGCombine?


replace -> replaced


This ordering should be mentioned in the header where genAlternativeCodeSequence is declared.


Exactly what are you proposing?


Is this a separable (or unrelated) change?

Gerolf updated this revision to Diff 11483.Jul 15 2014, 6:26 PM
Gerolf edited edge metadata.


  1. Added bool alwaysCombine() (Target/TargetInstrInfo.h) so targets

can decide to always replace a given pattern. This should be equivalent to
the current code in DAGCombine when a given pattern is disabled.

  1. InstrDepth is now a small vector (MachineCombiner.cpp)
  2. Added helper function instr2instrSC (MachineCombiner.cpp)
  3. Improved comments as suggested by reviewers

Just some minor issues that I've noticed (comments inlined). Otherwise looks good.



The insertMove method isn't used anywhere. Could it be removed?


This duplicates (almost) the code above. I think these should be merged.


This FIXME comment now seems confusing since you already added the fix.

82 ↗(On Diff #11483)

How are the div tests related to this change?

Gerolf updated this revision to Diff 11729.Jul 21 2014, 4:24 PM

Minor changes:
a) Cleaned up tests in test/CodeGen/AArch64: arm64-neon-mul-div.ll,
dp-3source.ll and mul-lohi.ll. The only change to a critical path
length is in the single block of mul-lohi.ll: generating mul-add
instead of madd shortens cpl from 16 to 8 cycles. Removed previous
tests since they were too elaborate for this commit.
b) Added capture extraCycles() in MachineTraceMetrics.cpp
c) Removed insertMove() in TargetInstInfo.h
d) Comments/Format changes

Please see below. Changes are on phabricator

Hi Gerolf,

A few comments inline and a general question:

It seems like the matching infrastructure is very verbose on the backend level. What alternatives do we have to make this smaller? Perhaps something ala the existing pattern match infrastructure we have for IR? Something else? It looks even more painful than DAGCombines at the moment to add things :)




Seems to be a very long function... could you break it up into computing the path and then making the determination?


Can use a typedef or a SmallVectorImpl instead of writing the number all over the place. Should help with formatting.




This conditional is a little hard to read - some way to break it up/hoist things out?


Extra space.


The name here is a bit limiting. What if you want to combine something else in the future? Same with the rest of these helpers.


Unnecessary cast?


"All potential patterns are..."


This code looks like it could be written ala include/llvm/IR/PatternMatch.h?


Documentation for these functions describing the incoming variables, constraints, etc.


"All potential patterns are..."

Gerolf updated this revision to Diff 11801.Jul 22 2014, 9:23 PM
  • Minor cleanups + added comments.
  • New functions getDepth() and getLatency() in MachineCombiner.cpp

to simplify preservesCriticalPathLen().

  • New function doSubstitute() to make a conditional in combineInstructions()

more readable

Thank you for the good suggestions! The target-specific part is not perfect yet and needs to evolve in stages. Please also see below.


Ah, I just noticed that a header file is missing (AArch64MachineCombinerPattern.h).

Otherwise I have no objections to an initial commit, provided that everyone is happy with this as well.



This header file is missing from the review.

Gerolf updated this revision to Diff 11920.Jul 26 2014, 7:05 PM

Add missing header file: lib/Target/AArch64/AArch64MachineCombinerPattern.h

Uups, thanks for catching this! Just added the header file to phabricator.


Looks like I spoke too soon (new comment inlined). I think there are some issues with targets without a machine model.

  • Silviu
silviu.baranga added inline comments.Jul 27 2014, 6:22 PM

Would it be better to always combine if there is no machine model (for example in case the cpu is not specified)? If we bail out here, we would no longer generate the patterns that are handled by this pass.

Hi Silviu,

I think that your suggestion is great! Every target should have the opportunity to invoke the machine combiner.



if (!TSchedModel.hasInstrSchedModel() && !TII->alwaysCombine()) {

DEBUG(dbgs() << "  Skipping pass: no machine model available\n");
return false;


Gerolf updated this revision to Diff 11961.Jul 28 2014, 3:33 PM

Update: Allow any target to invoke the machine combiner,
even when it does not support a instruction scheduling

Hi Gerolf,

As is, the madd/msub instructions still won't get generated unless the target hook gets implemented.
I don't see why any target would choose not to generate these instructions when no machine model is
present, and in this case we can have something more general here.

I was thinking more along the lines of changing doSubstitute to check if there is a machine model and
if there isn't return true. The tricky part here would be making sure that everything actually works ok
(the pass doesn't require a machine model when there isn't one).

Also, I think a regression test should be added for this (making sure that at madd/msub instructions are
still being generated).


Hi Silviu,

thanks for following up on this. I think we are on the same page wrt to the overall goal, but still have a different understanding of the code. Even with my changes madd/msub instructions are generated by the DAGCombiner *unless* a target *chooses* not to do so and instead invokes the MachineCombiner, which decides on combining based on a cost function. For AArch64 the code that disables the pattern in DAGCombiner is in lib/Target/AArch64/ There is no change to other targets. With the alwaysCombine() routine the target has the means to decide to always combine in the MachineCombiner. The code now ensures this can be done independent of the existence of a machine model.


Thanks, this all makes sense to me. However, I think there is still an issue with the default
AArch64 target.

I'll first give a summary:

I've ran the mul-lohi.ll test without a -mcpu flag (with and without this change) and with this change
we don't get the madd instructions, as would have been the case otherwise. What seems to happen
is that the pattern won't be selected during instruction selection because the patterns have been disabled
and the MachineCombiner pass bails out because there is no machine model.

So with the current change to AArch64, using the defaults we will not get madd/msub instructions
because nothing can generate them.

This seems to be an unintended effect and I don't think it's an improvement.

I agree that it is possible for someone to go and do the work to get the behaviour for the generic
AArch64 target back. Is there any reason why we wouldn't make sure that this is change is a nop
in this case now? It seems like the right thing to do.


Gerolf updated this revision to Diff 12139.Aug 1 2014, 7:54 PM

Support for madd/msub generation when no machine model is available
(but the target supports MachineCombiner).

Added test case test/CodeGen/madd-lohi.ll

Thanks! LGTM


Gerolf added a comment.Aug 3 2014, 3:21 PM

Thanks Silviu + Eric + Hal. Very much appreciate your feedback + reviews!


echristo resigned from this revision.May 26 2016, 12:43 PM
echristo removed a reviewer: echristo.