Page MenuHomePhabricator

Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)
ClosedPublic

Authored by spatel on Oct 7 2014, 5:52 PM.

Details

Summary

This is a first step for generating SSE rsqrt instructions for reciprocal square root calcs when fast-math is allowed.

For now, be conservative and only enable this for AMD btver2 where performance improves significantly - for example, 29% on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c if we convert the data type to single-precision float.

We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.

Follow-on patches may allow reciprocal (rcpss) optimizations, add more vector data types, and enable the optimization for more chips.

More background here: http://llvm.org/bugs/show_bug.cgi?id=20900

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 14534.Oct 7 2014, 5:52 PM
spatel retitled this revision from to Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: hfinkel, nadav.
spatel added subscribers: Unknown Object (MLST), RKSimon, tycho.
hfinkel edited edge metadata.Oct 9 2014, 12:42 PM

We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.

Critical path latency is good, but throughput is normally much better. According to Intel's optimization manual, rsqrtss, for example, is fully pipelined on most Intel cores (on Westmere and Nehalem the dispatch delay is 3 cycles, but 1 cycle elsewhere). But the dispatch delay time for sqrtss is 7 cycles on Haswell, 7-14 cycles on Sandy Bridge, something under 16 cycles for Westmere and Nehalem, and 11 cycles for Silvermont. The throughput for divss is a little better than sqrtss, but not by much.

In short, this is likely a big win *if* there is anything else going on (floating-point-wise), even on Intel cores. I could be wrong, but this very-much reminds me of the problem that the MachineCombiner pass tries to solve for FMAs, etc. on some targets, and I wonder if it could somehow be applied to this as well.

lib/Target/X86/X86ISelLowering.cpp
14341 ↗(On Diff #14534)

I'd really prefer that you put the 2-constant version of the algorithm into the DAGCombiner along side the 1-constant version, and just let the target pick. The algorithm itself is really a mathematical expression, and not at all really target dependent, and we should try to keep such things available to other targets without copy-and-paste.

Ideally, we'd then also have a flag to force one or the other, so that way PPC can default to the 1-constant version, X86 can default to the 2-constant version, but there's a command-line option I can use to force the choice for benchmarking.

spatel added a comment.Oct 9 2014, 2:50 PM
In D5658#4, @hfinkel wrote:

In short, this is likely a big win *if* there is anything else going on (floating-point-wise), even on Intel cores. I could be wrong, but this very-much reminds me of the problem that the MachineCombiner pass tries to solve for FMAs, etc. on some targets, and I wonder if it could somehow be applied to this as well.

Yes, I agree with the throughput argument. And if the user can flip just this one bit of codegen with an attribute flag, that would be ideal.

I didn't want to overstep with this patch though, so I thought it'd be best to just start with the core where I know it's always a win to get rid of the sqrtss/divss.

@tycho may be able to better assess the perf differences on Intel. Some informal benchmarking using an n-body program certainly does show big wins using the rsqrt code on Haswell.

lib/Target/X86/X86ISelLowering.cpp
14341 ↗(On Diff #14534)

Agree - I'll rework this.

spatel updated this revision to Diff 14694.Oct 9 2014, 6:38 PM
spatel edited edge metadata.

This version of the patch brings the 2-constant NR builder into DAGCombiner and adds a target-specified boolean to select whether we should use the 2-constant version or the 1-constant version.

We can add a command-line override for the NR selector in a subsequent patch.

hfinkel added inline comments.Oct 9 2014, 7:05 PM
lib/Target/X86/X86ISelLowering.cpp
14347 ↗(On Diff #14694)

Please write out "significant digits"

14355 ↗(On Diff #14694)

Why wouldn't it be?

spatel added inline comments.Oct 9 2014, 8:11 PM
lib/Target/X86/X86ISelLowering.cpp
14355 ↗(On Diff #14694)

A double-precision rsqrt estimate with refinement on x86 prior to FMA requires at least 16 instructions: convert to single, rsqrtss, convert back to double, refine (3 steps = at least 13 insts). I don't think Intel/AMD ever intended for that, or they would've added 'rsqrtsd' (similar to PPC's double-precision frsqrte). AFAICT, no x86 compiler tries to generate that sequence. Now that FMA has been introduced, it might be more feasible, but the HW implementations that have FMA also have really fast sqrt/div units, so it's again not worth it. Add this background to the code comment?

  • Original Message -----

From: "Sanjay Patel" <spatel@rotateright.com>
To: spatel@rotateright.com, nrotem@apple.com, hfinkel@anl.gov
Cc: steven@uplinklabs.net, llvm-dev@redking.me.uk, llvm-commits@cs.uiuc.edu
Sent: Thursday, October 9, 2014 10:11:56 PM
Subject: Re: [PATCH] Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)

Comment at: lib/Target/X86/X86ISelLowering.cpp:14355
@@ +14354,3 @@
+ TODO: Add support for AVX (v8f32) and AVX512 (v16f32).
+
TODO: Is it ever worthwhile to use an estimate for f64?
+ if (Subtarget->hasSSE1() && (VT == MVT::f32 || VT == MVT::v4f32))

{

hfinkel wrote:

Why wouldn't it be?

A double-precision rsqrt estimate with refinement on x86 prior to FMA
requires at least 16 instructions: convert to single, rsqrtss,
convert back to double, refine (3 steps = at least 13 insts). I
don't think Intel/AMD ever intended for that, or they would've added
'rsqrtsd' (similar to PPC's double-precision frsqrte). AFAICT, no
x86 compiler tries to generate that sequence. Now that FMA has been
introduced, it might be more feasible, but the HW implementations
that have FMA also have really fast sqrt/div units, so it's again
not worth it. Add this background to the code comment?

Yes, please. But in light of that, I'd probably not make it a "TODO", just say, "It is likely not profitable to do this for f64 because...".

-Hal

http://reviews.llvm.org/D5658

spatel updated this revision to Diff 14725.Oct 10 2014, 6:58 AM

Updated comments - thank you for the feedback!

Any other suggestions/improvements?

Using this n-body benchmark program:
https://github.com/tycho/nbody

...on a btver2 system, I see excellent performance improvements.

Before:

Running simulation with 16384 particles, crosscheck enabled, CPU enabled, 1 threads
CPU_SOA:            2.10 GFLOPS
CPU_SOA_tiled:   1.12 GFLOPS
CPU_AOS:            0.64 GFLOPS
CPU_AOS_tiled:   1.04 GFLOPS

After:

Running simulation with 16384 particles, crosscheck enabled, CPU enabled, 1 threads
CPU_SOA:           5.19 GFLOPS
CPU_SOA_tiled:  5.34 GFLOPS
CPU_AOS:           1.27 GFLOPS
CPU_AOS_tiled:  1.59 GFLOPS

Ping.

This one should be less controversial than http://reviews.llvm.org/D5787 / http://llvm.org/bugs/show_bug.cgi?id=21290 .
We're not actually adding any more fast-math complexity with this patch - just relying on the existing logic.

andreadb accepted this revision.Oct 23 2014, 7:59 AM
andreadb added a reviewer: andreadb.
andreadb added a subscriber: andreadb.

Hi Sanjay,

The changes to the dag combiner and all the x86 specific changes look good to me.
FWIW, the rest of the patch looks good to me too. But you might want to wait to see what Hal thinks.

This revision is now accepted and ready to land.Oct 23 2014, 7:59 AM
hfinkel accepted this revision.Oct 23 2014, 10:53 AM
hfinkel edited edge metadata.

LGTM too.

spatel closed this revision.Oct 24 2014, 10:12 AM
spatel updated this revision to Diff 15419.

Closed by commit rL220570 (authored by @spatel).

Thanks, Andrea and Hal. Checked in with r220570.