Page MenuHomePhabricator

[X86] Heuristic to selectively build Newton-Raphson SQRT estimation
ClosedPublic

Authored by n.bozhenov on Jun 15 2016, 7:38 AM.

Details

Summary

On modern Intel processors hardware SQRT in many cases is faster than RSQRT
followed by Newton-Raphson refinement. The patch introduces a simple heuristic
to choose between hardware SQRT instruction and Newton-Raphson software
estimation.

The patch treats scalars and vectors differently. The heuristic is that for
scalars the compiler should optimize for latency while for vectors it should
optimize for throughput.

Basically, the patch disables scalar NR for big cores and disables NR completely
for Skylake. Firstly, scalar SQRT has shorter latency than NR code in big cores.
Secondly, vector SQRT has been greatly improved in Skylake and has better
throughput compared to NR.

Diff Detail

Repository
rL LLVM

Event Timeline

n.bozhenov updated this revision to Diff 60832.Jun 15 2016, 7:38 AM
n.bozhenov retitled this revision from to [X86] Heuristic to selectively build Newton-Raphson SQRT estimation.
n.bozhenov updated this object.

Below are some figures to justify the change.

Latency/throughput data from Architecture Optimization Manual:

|      |  IVB |  HSW  |   BDW | SKL  |
|------+------+-------+-------+------|
| x32  | 14/7 |  13/7 |  13/4 | 13/3 |
| x128 | 14/7 |  13/7 |  13/7 | 13/3 |
| x256 |      | 19/13 | 19/13 | 12/6 |

Experimental Newton-Raphson efficiency for latency-bound code:

|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | -41% | -40% | -21% | -40% |
| x128 | -32% | -32% | -17% | -35% |

Experimental Newton-Raphson efficiency for throughput-bound code:

|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | +18% | +21% | -17% | -40% |
| x128 | +10% | +14% | +28% | -50% |
| x256 |      | +68% | +85% |  +3% |

Latency/throughput data are for SQRT instruction of course.

spatel edited edge metadata.Jun 15 2016, 1:16 PM

Below are some figures to justify the change.
Experimental Newton-Raphson efficiency for latency-bound code:

|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | -41% | -40% | -21% | -40% |
| x128 | -32% | -32% | -17% | -35% |

Experimental Newton-Raphson efficiency for throughput-bound code:

|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | +18% | +21% | -17% | -40% |
| x128 | +10% | +14% | +28% | -50% |
| x256 |      | +68% | +85% |  +3% |
  1. Shouldn't HSW show a latency improvement over IVB from using FMA?
  2. How many N-R steps are included in your measurements?
  3. Do the measurements include the change from D21127?

When we enabled the estimate generation code ( https://llvm.org/bugs/show_bug.cgi?id=21385#c32 ), we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput. This patch proposes to change that behavior for those targets (ie, favor latency at the expense of throughput). Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU. That way we'll have coverage for the expected behavior independently of any individual CPU.

I have no objection to the AMDGPU changes.

n.bozhenov updated this revision to Diff 65344.Jul 25 2016, 7:35 AM
n.bozhenov edited edge metadata.

An updated version of the patch is uploaded. After more careful benchmarking and
analysis I found a performance problem in a corner case when both SQRT(x) and
RSQRT(x) are required. Indeed, if this is the case the compiler may build a
plain SQRTSS instruction to calculate SQRT(x) and a RSQRTSS followed by
refinement to calculate RSQRT(x). So, I've added an additional check to
X86TargetLowering::isFsqrtCheap to avoid building both SQRT and RSQRT
instructions for the same input value.

Great questions, Sanjay!

Shouldn't HSW show a latency improvement over IVB from using FMA?

FMA doesn't much affect the results. In most cases the difference between FMA
code and non-FMA code is not crucial. The only case significantly affected by
FMA is scalar SQRT on Haswell where NR got 15% higher throughput with FMA. Here
is the table for throughput-bound FMA code:

|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | +38% | -12% | -26% |
| x128 | +12% | +32% | -30% |
| x256 | +69% | +84% |  +6% |

And the updated table for latency-bound FMA code:

|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | -32% | -20% | -25% |
| x128 | -34% | -28% | -25% |
| x256 | -21% |  +6% |  -2% |

How many N-R steps are included in your measurements?

I benchmarked the default number of steps which is one refinement step.

Do the measurements include the change from D21127?

Yes.

When we enabled the estimate generation code, we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput.

The idea behind this patch is that throughput bound code is likely to be
vectorized, so for vectorized code we should care about the throughput. But if
the code is scalar that probably means that the code has some kind of dependency
and we should care more about reducing the latency.

You mentioned PR21385, but the PR mostly deals with reciprocal square roots
which are NOT affected by this patch. SQRTSS+DIVSS is a way too heavy
combination even for Skylake. So, this patch is only about using SQRTSS/SQRTPS
instructions to calculate non-reciprocal roots.

Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

We can see large performance improvements for some our internal benchmarks. And
they were our motivating examples. As for Specs, the updated version of the
patch doesn't affect any hot code in Spec2000 and Spec2006. Neither it affects
hot code in testsuite/Bullet benchmark (which is a very sqrt-intensive one).

Also, this patch makes difference not only to performance. It also improves the
accuracy in the affected cases. And generally we should prefer more precise code
even with fast-math unless we expect significant performance improvement.

For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU.

Done. I've added a few more RUNs to the test to check the attributes themselves.

hfinkel edited edge metadata.Jul 25 2016, 12:57 PM

Great questions, Sanjay!

Shouldn't HSW show a latency improvement over IVB from using FMA?

FMA doesn't much affect the results. In most cases the difference between FMA
code and non-FMA code is not crucial. The only case significantly affected by
FMA is scalar SQRT on Haswell where NR got 15% higher throughput with FMA. Here
is the table for throughput-bound FMA code:

|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | +38% | -12% | -26% |
| x128 | +12% | +32% | -30% |
| x256 | +69% | +84% |  +6% |

And the updated table for latency-bound FMA code:

|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | -32% | -20% | -25% |
| x128 | -34% | -28% | -25% |
| x256 | -21% |  +6% |  -2% |

How many N-R steps are included in your measurements?

I benchmarked the default number of steps which is one refinement step.

Do the measurements include the change from D21127?

Yes.

When we enabled the estimate generation code, we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput.

The idea behind this patch is that throughput bound code is likely to be
vectorized, so for vectorized code we should care about the throughput. But if
the code is scalar that probably means that the code has some kind of dependency
and we should care more about reducing the latency.

You mentioned PR21385, but the PR mostly deals with reciprocal square roots
which are NOT affected by this patch. SQRTSS+DIVSS is a way too heavy
combination even for Skylake. So, this patch is only about using SQRTSS/SQRTPS
instructions to calculate non-reciprocal roots.

Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

We can see large performance improvements for some our internal benchmarks. And
they were our motivating examples. As for Specs, the updated version of the
patch doesn't affect any hot code in Spec2000 and Spec2006. Neither it affects
hot code in testsuite/Bullet benchmark (which is a very sqrt-intensive one).

Also, this patch makes difference not only to performance. It also improves the
accuracy in the affected cases. And generally we should prefer more precise code
even with fast-math unless we expect significant performance improvement.

For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU.

Done. I've added a few more RUNs to the test to check the attributes themselves.

I'd obviously also like to think that any throughput sensitive code is vectorized (or at least that the vectorizer has unrolled it to hide latency if helpful). In general this low-latency/low-throughput vs. high-latency/high-throughput is exactly the kind of situation that the MachineCombiner is supposed to handle. It might be difficult to use here, however, because we want to insert NR before instruction selection (to take advantage of DAGCombine simplifications and complex ISel patterns), and replacing the estimate/NR sequence in the MachineCombiner might be tricky (specially since the user can select the number of iterations, and so only matching the simple case of one iteration would be undesirable). As a result, unfortunately, phase ordering constraints might prevent using the principled solution here. The heuristic here does not seem unreasonable, however I just wish we did not have to use it.

In general, however, the rationale needs to be much better documented in the code. I don't see any comments in the patch itself about the latency/throughput tradeoff.

n.bozhenov updated this revision to Diff 65527.Jul 26 2016, 8:33 AM
n.bozhenov edited edge metadata.

In general, however, the rationale needs to be much better documented in the code. I don't see any comments in the patch itself about the latency/throughput tradeoff.

Hal, you are right. The patch lacked comments explaining the tradeoff.
The new version of the patch adds the missed comment to X86.td file.

Hal's comment about the MachineCombiner reminded me of the excellent list of FMA loop examples provided by @v_klochkov in:
https://reviews.llvm.org/D18751.

Summary: the latency vs. throughput trade-off is tricky even if we defer the decision to a later pass.

LGTM, but Hal can confirm if the added comments are sufficient.

hfinkel accepted this revision.Aug 2 2016, 9:31 AM
hfinkel edited edge metadata.

Hal's comment about the MachineCombiner reminded me of the excellent list of FMA loop examples provided by @v_klochkov in:
https://reviews.llvm.org/D18751.

Summary: the latency vs. throughput trade-off is tricky even if we defer the decision to a later pass.

LGTM, but Hal can confirm if the added comments are sufficient.

Yes, thanks!

This revision is now accepted and ready to land.Aug 2 2016, 9:31 AM
This revision was automatically updated to reflect the committed changes.