This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Optionally use the Newton series for reciprocal estimation
ClosedPublic

Authored by evandro on Oct 5 2016, 12:53 PM.

Details

Summary

This patch adds support for estimating the square root or its reciprocal and division or reciprocal using the combiner generic Newton series.

Diff Detail

Repository
rL LLVM

Event Timeline

evandro updated this revision to Diff 73683.Oct 5 2016, 12:53 PM
evandro retitled this revision from to [AArch64] Optionally use the reciprocal estimation machinery.
evandro updated this object.
evandro added reviewers: spatel, hfinkel.
evandro set the repository for this revision to rL LLVM.
evandro added subscribers: llvm-commits, n.bozhenov, echristo and 2 others.
spatel edited edge metadata.Oct 7 2016, 7:48 AM

This patch mostly follows the existing pattern used by PPC and x86, so I have no objections. But I know there has been some controversy about the use of a CPU attribute as the enabling device. Someone from the AArch64 camp should comment on that. I don't know enough about the various CPU implementations to say whether there's a better way.

Note that in x86, the recent Intel FPUs are so fast that we have the opposite CPU attribute "FeatureFastScalarFSQRT" to turn *off* reciprocal codegen via the target hook isFsqrtCheap(). This may also be controversial (shouldn't these CPU-model-specific-transforms happen at the machine instruction level?), but there is a substantial precedent for fast/slow attributes used in the DAG as heuristics for isel.

llvm/lib/Target/AArch64/AArch64.td
109–111

reverse -> reciprocal ?

jmolloy requested changes to this revision.Oct 7 2016, 7:54 AM
jmolloy added a reviewer: jmolloy.

Hi,

Yes, I've said multiple times that I'm opposed to enabling this by feature and I stand by that. If someone can show a good reason for it, fair enough but I haven't seen a good reasoning (for AArch64/ARM) so far.

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Alternatively there may exist a target with such a slow SQRT unit that RSQRTE/RSQRTS is always better regardless of context, but I haven't seen any evidence for that either.

Cheers,

James

This revision now requires changes to proceed.Oct 7 2016, 7:54 AM
rengolin added a subscriber: jojo.Oct 7 2016, 7:58 AM
evandro updated this revision to Diff 73935.Oct 7 2016, 8:14 AM
evandro edited edge metadata.
evandro marked an inline comment as done.

s/reverse/reciprocal/

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Alternatively there may exist a target with such a slow SQRT unit that RSQRTE/RSQRTS is always better regardless of context, but I haven't seen any evidence for that either.

The M1 is it. Indeed, not always, but most of the time.

jmolloy accepted this revision.Oct 7 2016, 8:20 AM
jmolloy edited edge metadata.

Hi,

The M1 is it. Indeed, not always, but most of the time.

Well I can't argue with that. You know the microarchitecture! :)

This revision is now accepted and ready to land.Oct 7 2016, 8:20 AM
evandro updated this revision to Diff 73936.Oct 7 2016, 8:22 AM
evandro edited edge metadata.
rengolin requested changes to this revision.Oct 7 2016, 8:23 AM
rengolin added a reviewer: rengolin.

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This revision now requires changes to proceed.Oct 7 2016, 8:23 AM

Isn't this the same patch Eric reverted a few weeks ago?

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Of course, one can always use -mattr=+use-reciprocal-square-root.

Thank you.

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Of course, one can always use -mattr=+use-reciprocal-square-root.

Thank you.

Surely not, because the attribute should merely indicate that use of a reciprocal is *allowed*, not that it is worthwhile.

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This is a modification of the original patch to use the function attribute.

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Of course, one can always use -mattr=+use-reciprocal-square-root.

Surely not, because the attribute should merely indicate that use of a reciprocal is *allowed*, not that it is worthwhile.

Enabling the attribute has the effect of using the reciprocal for sqrt(), allowing the user to explore if it's worthwhile in his specific case.

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This is a modification of the original patch to use the function attribute.

Right, ok. Let's wait for Eric's review to make sure it solves the problem he was seeing.

Still doesn't solve the problem of choosing it on other AArch64 cores upon analysis, nor offers a good way to test it (as James said).

Maybe we should do as Jojo said earlier and leave it as a hidden flag, disabled by default, so we can test on everyone's side before this goes live, even for M1.

cheers,
--renato

Maybe we should do as Jojo said earlier and leave it as a hidden flag, disabled by default, so we can test on everyone's side before this goes live, even for M1.

Everyone's testing should not gate M1.

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This is a modification of the original patch to use the function attribute.

Right, ok. Let's wait for Eric's review to make sure it solves the problem he was seeing.

As @spatel said, this patch mostly follows the changes done for PPC.

@jmolloy mentioned the surrounding context for deciding when to use the estimate instructions. I don't think anyone would argue that using an isel attribute to make the decision is anything more than a heuristic.

The alternative is to wait and/or fixup the isel decision in MachineCombiner or some other machine pass. But I think it's worth copying this comment from D18751 / @v_klochkov again - this is at least the 3rd time I've done this. :)

The comment is about FMA, and the examples use x86 cores, but the problem for the compiler is the same: choosing the optimal instructions is a hard problem, and it may not be possible to make this decision without some kind of heuristic.

Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
to let other developers have them in view/attention when they add latency-vs-throughput fixes.

My biggest concern regarding making Latency-vs-Throughput decisions is that
such decisions are often made using just one pattern or DAG, it is not based on the whole loop
analysis (perhaps I am missing something in LLVM).

I provided 4 examples having quite similar code in them.

Example1 - shows that FMAs can be very harmful for performance on Haswell.
Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
           It also shows that it is often enough to replace only 1 FMA to fix the problem and leave other FMAs.
Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
           exists, and tuning may be different for different DAGs in one loop.

Ok, let's start...

Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.

Example1:
!NOTE: Please assume that the C code below only represents the structure of the final ASM code
(i.e. the loop is not unrolled, etc.)

  // LOOP1
  for (unsigned i = 0; i < N; i++) {
    accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
  }
  with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5  
  without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
              MUL operation adds nothing as it is computed out-of-order,
			  i.e. the result of MUL is always available when it is ready to be consumed by ADD.

Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
comparing to the code without FMAs.

On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
work the same time because the latency of ADD is equal to latency of FMA there.

Example2:
The same problem still can be easily reproduced on SkyLake even though the
latencies of MUL/ADD/FMA are all equal there:

// LOOP2
for (unsigned i = 0; i < N; i++) {
  accu = a[i] * b + c[i] * d + accu;
}

There may be at least 3 different sequences for the LOOP2:
S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); LATENCY = 2xLAT(FMA) = 2*4
S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU)
LATENCY = 2xLAT(ADD) = 2*4
S3: 1xFMA: ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4

In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
before they are ready to be consumed by ADD.
So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.

Example3:
It shows that the heuristics that could be implemented for Example1 and Example2
may be wrong if applied without the whole loop analysis.

// LOOP3
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1;
  accu2 = a2[i] * b + c2[i] * d + accu2;
  accu3 = a3[i] * b + c3[i] * d + accu3;
  accu4 = a4[i] * b + c4[i] * d + accu4;
  accu5 = a5[i] * b + c5[i] * d + accu5;
  accu6 = a6[i] * b + c6[i] * d + accu6;
  accu7 = a7[i] * b + c7[i] * d + accu7;
  accu8 = a8[i] * b + c8[i] * d + accu8;
}

This loop must be tuned for throughput because there are many independent DAGs
putting high pressure on the CPU execution units.
The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
"ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
It works faster because the loop is bounded by throughput.

On SkyLake:

T = approximate throughput of the loop counted in clock-ticks = 
  N * 16 operations / 2 execution units = N*8
L = latency of the loop = 
  N * 2*Lat(FMA) = N*2*4 = N*8

The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).

The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
the time spent in the loop is MAX(L,T) will only be bigger.

Example4:
There may be mixed tuning, i.e. for both throughput and latency in one loop:

// LOOP4
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
  accu2 = a2[i] * b + accu2; // tune for throughput
  accu3 = a3[i] * b + accu3; // tune for throughput
  accu4 = a4[i] * b + accu4; // tune for throughput
  accu5 = a5[i] * b + accu5; // tune for throughput
  accu6 = a6[i] * b + accu6; // tune for throughput
}

On Haswell:
If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then

Latency of the loop is L = N*2*Latency(FMA) = N*2*5   
Throughput T = N * 7 / 2
MAX (L,T) = N*10

Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to

L = N*Latency(FMA) = N*5,
and will only slightly increase T from N*3.5 to 
T = N * 8 operations / 2 execution units = N*4

As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.

Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).

L = N*Latency(ADD) = N*3
T = N * 13 operations / 2 = N*6.5
MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.

So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.
`

Everyone's testing should not gate M1.

No, but this patch has been reverted before, and I'd rather wait for Eric's comment before pushing this again.

Making it a hidden disabled option would give you the ability to enable downstream and us the ability to test on other cores / situations, and wouldn't break Eric's tests.

cheers,
--renato

evandro added inline comments.Oct 7 2016, 1:14 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
644

I guess that this should be qualified with STI.hasNEON()...

4618

... and this check removed.

4625

I guess that target information should not be present here anymore. But do f16 types make sense when, though supported by the target, TargetRecip does not support them?

evandro added inline comments.Oct 7 2016, 3:24 PM
llvm/test/CodeGen/AArch64/sqrt-fastmath.ll
2

s/reverse/reciprocal/

evandro marked 4 inline comments as done.Oct 17 2016, 8:58 AM
evandro updated this revision to Diff 74871.Oct 17 2016, 12:29 PM
evandro edited edge metadata.

No, but this patch has been reverted before, and I'd rather wait for Eric's comment before pushing this again.

@echristo, @rengolin insists on your input.

Note that I recommitted D25440 at rL284746 which will affect this patch - should make it smaller. :)
I think that commit will stick this time; the bots that failed with the earlier version appear happy now.

Note that I recommitted D25440 at rL284746 which will affect this patch - should make it smaller. :)

Thank you for the heads up.

evandro updated this revision to Diff 75435.Oct 21 2016, 8:25 AM
evandro retitled this revision from [AArch64] Optionally use the reciprocal estimation machinery to [AArch64] Optionally use the Newton series for reciprocal estimation.
evandro edited edge metadata.
echristo accepted this revision.Oct 21 2016, 1:42 PM
echristo edited edge metadata.

LGTM.

Thanks!

-eric

This revision was automatically updated to reflect the committed changes.