This is an archive of the discontinued LLVM Phabricator instance.

Refactor reciprocal and reciprocal square root estimate into target-independent functions (part 2).
ClosedPublic

Authored by spatel on Sep 24 2014, 12:47 PM.

Details

Summary

This is purely refactoring. No functional changes intended.

The ultimate goal is to allow targets other than PowerPC (certainly X86 and Aarch64) to turn this:

z = y / sqrt(x)

into:

z = y * rsqrte(x)

And:

z = y / x

into:

z = y * rcpe(x)

using whatever HW magic they can use. See http://llvm.org/bugs/show_bug.cgi?id=20900 .

In part 1 ( http://reviews.llvm.org/D5425 ) of this refactoring, I moved just the wrapper portion of the square root estimate out of the PPC backend and into DAGCombiner. In this patch, I've moved everything that I can out of PPCISelLowering and into DAGCombiner.

It turns out that we might as well grab the reciprocal estimate code too because I think that any hardware that provides a rsqrt estimate is also going to provide a recip estimate. And PPC even uses rcpe to generate sqrt from rsqrte! I added a visitFSQRT() to DAGCombiner to keep that functionality.

There are small hooks in TargetLowering to get the target-specific opcode for each estimate instruction and a function to tell DAGCombiner how many times it needs to run the Newton-Raphson refinement loop.

This will allow any target to generate the estimate code by implementing these methods:

virtual SDValue getRecipEst(SDValue Op, DAGCombinerInfo &DCI) const;
virtual SDValue getRSqrtEst(SDValue Op, DAGCombinerInfo &DCI) const;
virtual unsigned getNRSteps(EVT VT) const;

Diff Detail

Event Timeline

spatel updated this revision to Diff 14047.Sep 24 2014, 12:47 PM
spatel retitled this revision from to Refactor reciprocal and reciprocal square root estimate into target-independent functions (part 2)..
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: chandlerc, hfinkel.
spatel added subscribers: Unknown Object (MLST), tycho.
hfinkel accepted this revision.Sep 25 2014, 1:21 PM
hfinkel edited edge metadata.

LGTM, thanks!

include/llvm/Target/TargetLowering.h
2632

The number of iterations necessary for the reciprocal estimate and for the reciprocal sqrt estimate might be different. Please provide a way to differentiate (and I'd want to make really sure the target actually overrides this). Maybe:

virtual unsigned getNRSteps(EVT VT, bool SqrtEst) const {
  llvm_unreachable("Target must provide the number of iterations");
}
This revision is now accepted and ready to land.Sep 25 2014, 1:21 PM
spatel added inline comments.Sep 26 2014, 9:02 AM
include/llvm/Target/TargetLowering.h
2632

Sure - I'll make unique functions to return iteration counts for sqrte and rcpe.

We may need one more refinement here regarding the rcpe(rsqrt(x)) transformation of a regular sqrt(x)...my guess is that's not a win on any recent X86 (and probably not PPC either?). But that change can come later if needed.

hfinkel added inline comments.Sep 26 2014, 9:39 AM
include/llvm/Target/TargetLowering.h
2632

Regarding PPC, you might be right about some of them -- it is certainly a win on the embedded cores where the sqrt instruction is not fully pipelined. We'll need to do some measurements.

spatel added inline comments.Sep 26 2014, 1:37 PM
include/llvm/Target/TargetLowering.h
2632

It's coming back to me now (used to be at IBM and Apple)...
I think the deciding factor is not whether the sqrt instruction is pipelined, but whether it exists at all. Eg, 7400/7450 had fre/frsqrte, but lacked fsqrt. In that case, the decision is between doing a long sequence of dependent ops using the estimates vs. making a call to libm sqrt(). If fsqrt exists, it should probably be used unless there's some truly horrible HW implementation out there.
Certainly, this should be measured on as many targets as possible to see if it's true.

spatel updated this revision to Diff 14136.Sep 26 2014, 2:09 PM
spatel edited edge metadata.

Instead of making different functions for each estimate possibility, I think it's better to make the getEstimate() interface in TargetLowering as generic as possible and let the targets do whatever they want under that. Eg, Altivec provides estimates for log() and exp(). These could conceivably be used to replace libm calls. GPUs might have instructions for those too.

As part of the minimization, I rolled the refinement steps parameter into the single API. Please let me know if you see a better or more elegant way.

spatel closed this revision.Sep 26 2014, 4:12 PM
spatel updated this revision to Diff 14140.

Closed by commit rL218553 (authored by @spatel).

hfinkel added inline comments.Sep 27 2014, 6:22 PM
include/llvm/Target/TargetLowering.h
2632

Obviously whether it exists at all matters, but the pipelining definitely also matters -- at least on some cores. On the A2 (an embedded core), for example, a full sqrt blocks the issuing thread from issuing any additional floating-point instructions for 69 cycles. There the pipelining definitely matters, but on other cores I'm less certain (which is why I said that we'd need to measure it).