This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine multiple FDIVs with the same divisor
ClosedPublic

Authored by HaoLiu on Nov 19 2014, 11:04 PM.

Details

Summary

Hi Tim and other reviewers,

This patch try to combine multiple FDIVs with the same divisor to one FDIV and multiple FMULs. This can have benefit on performance because a FMUL is much faster than a FDIV.
E.g. we combine:

a / D; 
b / D; 
c / D;

To

recip = 1.0 / D;
a * recip;
b * recip;
c * recip;

This is not always benefit, as we can see that the critical path increases from one FDIV to one FDIV and one FMUL, which may cause regressions. So this patch will only do such combine when there are more than 2 FDIVs.

This patch can only benefit some special benchmarks.
Our performance test on Cortex-A57 shows only SPEC2006 benchmark 188.ammp has 2.5%-3.0% improvement.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

HaoLiu retitled this revision from to [AArch64] Combine multiple FDIVs with the same divisor.
HaoLiu updated this object.
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu added a reviewer: t.p.northover.
HaoLiu added a subscriber: Unknown Object (MLST).

For reference, Ben proposed a target-independent solution here:
http://llvm.org/bugs/show_bug.cgi?id=16218

Unfortunately, it never made it into mainline because there were some regressions that needed inspection. The obvious advantage of the target-independent approach is that it works for all targets and across basic blocks. However, performing this transform across basic blocks may not matter in practice, as Duncan pointed out. Also, it may prevent other IR transforms from firing.

Can you please put this in DAGCombine, and let the target optionally enable it. The target can customize this:

// Skip if there is less than three FDIVs.
// FIXME: Different subtargets may behave differently. This can be
// controlled depending on subtargets.
if (Users.size() < 3)

I think that adding something like:

virtual bool combineRepeatedFPDivisors(unsigned &MinUsers) {
  return false;
}

added in TargetLowering.h right around the existing division functions would work well (I'm not attached to the name, feel free to propose some other name). I'd like to use this for PPC too.

spatel added a subscriber: spatel.Nov 20 2014, 8:02 AM

Just chiming in for the target-independent path: x86 needs this too.

I don't understand the expense argument cited here:
http://llvm.org/bugs/show_bug.cgi?id=16218#c4

Is checking uses in InstCombine more expensive than in DAGCombine? This transform is only firing with fast FP-math. Does that make it any more tolerable?

HaoLiu updated this revision to Diff 16468.Nov 20 2014, 7:28 PM
In D6334#5, @mcrosier wrote:

For reference, Ben proposed a target-independent solution here:
http://llvm.org/bugs/show_bug.cgi?id=16218

Unfortunately, it never made it into mainline because there were some regressions that needed inspection. The obvious advantage of the target-independent approach is that it works for all targets and across basic blocks. However, performing this transform across basic blocks may not matter in practice, as Duncan pointed out. Also, it may prevent other IR transforms from firing.

Hi Chad,

Yes, if it cross basic blocks, sometimes it may cause regressions. Also, as it is a target specific problem, so we think it's better to do such combine in CodeGen.

Thanks,
-Hao

In D6334#7, @hfinkel wrote:

Can you please put this in DAGCombine, and let the target optionally enable it. The target can customize this:

// Skip if there is less than three FDIVs.
// FIXME: Different subtargets may behave differently. This can be
// controlled depending on subtargets.
if (Users.size() < 3)

I think that adding something like:

virtual bool combineRepeatedFPDivisors(unsigned &MinUsers) {
  return false;
}

added in TargetLowering.h right around the existing division functions would work well (I'm not attached to the name, feel free to propose some other name). I'd like to use this for PPC too.

Hi Hale,

That's a good idea.

I've attached a new patch moving the logic into DAGCombiner.

Thanks,
-hao

hfinkel accepted this revision.Nov 20 2014, 7:36 PM
hfinkel added a reviewer: hfinkel.

Please change how the comparison is done as noted below, otherwise LGTM.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7115

Don't do the comparison by creating new nodes. You can:

if (auto *CN0 = dyn_cast<ConstantFPSDNode>(N0)) {
  if (CN0->isExactlyValue(1.0))
    return SDValue();
}

Create the FPOne later if you need it.

This revision is now accepted and ready to land.Nov 20 2014, 7:36 PM
In D6334#9, @spatel wrote:

Just chiming in for the target-independent path: x86 needs this too.

I don't understand the expense argument cited here:
http://llvm.org/bugs/show_bug.cgi?id=16218#c4

Is checking uses in InstCombine more expensive than in DAGCombine? This transform is only firing with fast FP-math. Does that make it any more tolerable?

Hi Sanjay,

I think it is a target specific problem. Some target may not want such combine.
E.g. we combine "N FDIVs" into "1 FDIVs and N FMULs". One shortcoming is that there is one more instruction. Also, different target may have different costs for FDIV and FMUL. Even though we can suppose that a FMUL is fater than a FDIV, but maybe "2 FDIV" is faster than "1 FDIV and 2 FMUL".

So I think it's better to be solved in the backend.

Thanks,
-Hao

HaoLiu added a comment.EditedNov 20 2014, 10:53 PM
In D6334#14, @hfinkel wrote:

Please change how the comparison is done as noted below, otherwise LGTM.

Hi Hale,

Nice suggestion.
The patch has been modified and committed in http://llvm.org/viewvc/llvm-project?view=revision&revision=222510.

Thanks,
-Hao

Hi Hao,

Thanks for the patch. I think ARM needs it too.

HaoLiu closed this revision.Dec 21 2014, 10:29 PM