This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1/1
AArch64.td
-
AArch64ISelLowering.h
3/3
AArch64ISelLowering.cpp
-
AArch64InstrInfo.td
-
AArch64Subtarget.h
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
recp-fastmath.ll
1/1
sqrt-fastmath.ll

Differential D25291

[AArch64] Optionally use the Newton series for reciprocal estimation
ClosedPublic

Authored by evandro on Oct 5 2016, 12:53 PM.

Download Raw Diff

Details

Reviewers

spatel
rengolin
jmolloy
echristo
hfinkel

Commits

rGeff2bd9d4f36: [AArch64] Optionally use the Newton series for reciprocal estimation
rL284986: [AArch64] Optionally use the Newton series for reciprocal estimation

Summary

This patch adds support for estimating the square root or its reciprocal and division or reciprocal using the combiner generic Newton series.

Diff Detail

Repository: rL LLVM

Event Timeline

evandro updated this revision to Diff 73683.Oct 5 2016, 12:53 PM

evandro retitled this revision from to [AArch64] Optionally use the reciprocal estimation machinery.

evandro updated this object.

evandro added reviewers: spatel, hfinkel.

evandro set the repository for this revision to rL LLVM.

evandro added subscribers: llvm-commits, n.bozhenov, echristo and 2 others.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptOct 5 2016, 12:54 PM

evandro added a child revision: D22975: [DAG Combiner] Fix the native computation of the Newton series for reciprocals.Oct 5 2016, 12:55 PM

This patch mostly follows the existing pattern used by PPC and x86, so I have no objections. But I know there has been some controversy about the use of a CPU attribute as the enabling device. Someone from the AArch64 camp should comment on that. I don't know enough about the various CPU implementations to say whether there's a better way.

Note that in x86, the recent Intel FPUs are so fast that we have the opposite CPU attribute "FeatureFastScalarFSQRT" to turn *off* reciprocal codegen via the target hook isFsqrtCheap(). This may also be controversial (shouldn't these CPU-model-specific-transforms happen at the machine instruction level?), but there is a substantial precedent for fast/slow attributes used in the DAG as heuristics for isel.

llvm/lib/Target/AArch64/AArch64.td
109–111	reverse -> reciprocal ?

Hi,

Yes, I've said multiple times that I'm opposed to enabling this by feature and I stand by that. If someone can show a good reason for it, fair enough but I haven't seen a good reasoning (for AArch64/ARM) so far.

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Alternatively there may exist a target with such a slow SQRT unit that RSQRTE/RSQRTS is always better regardless of context, but I haven't seen any evidence for that either.

Cheers,

James

This revision now requires changes to proceed.Oct 7 2016, 7:54 AM

rengolin added a subscriber: jojo.Oct 7 2016, 7:58 AM

s/reverse/reciprocal/

In D25291#564496, @jmolloy wrote:

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Alternatively there may exist a target with such a slow SQRT unit that RSQRTE/RSQRTS is always better regardless of context, but I haven't seen any evidence for that either.

The M1 is it. Indeed, not always, but most of the time.

Hi,

The M1 is it. Indeed, not always, but most of the time.

Well I can't argue with that. You know the microarchitecture! :)

This revision is now accepted and ready to land.Oct 7 2016, 8:20 AM

evandro updated this revision to Diff 73936.Oct 7 2016, 8:22 AM

evandro edited edge metadata.

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This revision now requires changes to proceed.Oct 7 2016, 8:23 AM

Isn't this the same patch Eric reverted a few weeks ago?

In D25291#564529, @evandro wrote:

In D25291#564496, @jmolloy wrote:

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Of course, one can always use -mattr=+use-reciprocal-square-root.

Thank you.

In D25291#564549, @evandro wrote:

In D25291#564529, @evandro wrote:

In D25291#564496, @jmolloy wrote:

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Of course, one can always use -mattr=+use-reciprocal-square-root.

Thank you.

Surely not, because the attribute should merely indicate that use of a reciprocal is *allowed*, not that it is worthwhile.

In D25291#564540, @rengolin wrote:

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This is a modification of the original patch to use the function attribute.

In D25291#564552, @jmolloy wrote:

In D25291#564549, @evandro wrote:

In D25291#564529, @evandro wrote:

In D25291#564496, @jmolloy wrote:

If you want to just enable reciprocal selection and test it, then a cl::opt flag seems most appropriate because that's how we enable experimental stuff broad-brush for testing. A CPU feature really isn't right as it ignores the important context that should go into deciding whether to use these instructions (on ARM/AArch64).

Adding an option is a good idea to provide a means for users to tap into this feature.

Of course, one can always use -mattr=+use-reciprocal-square-root.

Surely not, because the attribute should merely indicate that use of a reciprocal is *allowed*, not that it is worthwhile.

Enabling the attribute has the effect of using the reciprocal for sqrt(), allowing the user to explore if it's worthwhile in his specific case.

In D25291#564558, @evandro wrote:

In D25291#564540, @rengolin wrote:

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This is a modification of the original patch to use the function attribute.

Right, ok. Let's wait for Eric's review to make sure it solves the problem he was seeing.

Still doesn't solve the problem of choosing it on other AArch64 cores upon analysis, nor offers a good way to test it (as James said).

Maybe we should do as Jojo said earlier and leave it as a hidden flag, disabled by default, so we can test on everyone's side before this goes live, even for M1.

cheers,
--renato

In D25291#564562, @rengolin wrote:

Maybe we should do as Jojo said earlier and leave it as a hidden flag, disabled by default, so we can test on everyone's side before this goes live, even for M1.

Everyone's testing should not gate M1.

In D25291#564562, @rengolin wrote:

Wait, Eric said there was an LTO problem with selecting it per sub-arch. I'd rather him review it first.

This is a modification of the original patch to use the function attribute.

Right, ok. Let's wait for Eric's review to make sure it solves the problem he was seeing.

As @spatel said, this patch mostly follows the changes done for PPC.

@jmolloy mentioned the surrounding context for deciding when to use the estimate instructions. I don't think anyone would argue that using an isel attribute to make the decision is anything more than a heuristic.

The alternative is to wait and/or fixup the isel decision in MachineCombiner or some other machine pass. But I think it's worth copying this comment from D18751 / @v_klochkov again - this is at least the 3rd time I've done this. :)

The comment is about FMA, and the examples use x86 cores, but the problem for the compiler is the same: choosing the optimal instructions is a hard problem, and it may not be possible to make this decision without some kind of heuristic.

In D18751#402906, @v_klochkov wrote:
Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
to let other developers have them in view/attention when they add latency-vs-throughput fixes.

My biggest concern regarding making Latency-vs-Throughput decisions is that
such decisions are often made using just one pattern or DAG, it is not based on the whole loop
analysis (perhaps I am missing something in LLVM).

I provided 4 examples having quite similar code in them.
Example1 - shows that FMAs can be very harmful for performance on Haswell.
Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
           It also shows that it is often enough to replace only 1 FMA to fix the problem and leave other FMAs.
Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
           exists, and tuning may be different for different DAGs in one loop.
Ok, let's start...

Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.

Example1:
!NOTE: Please assume that the C code below only represents the structure of the final ASM code
(i.e. the loop is not unrolled, etc.)
  // LOOP1
  for (unsigned i = 0; i < N; i++) {
    accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
  }
  with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5  
  without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
              MUL operation adds nothing as it is computed out-of-order,
			  i.e. the result of MUL is always available when it is ready to be consumed by ADD.
Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
comparing to the code without FMAs.

On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
work the same time because the latency of ADD is equal to latency of FMA there.

Example2:
The same problem still can be easily reproduced on SkyLake even though the
latencies of MUL/ADD/FMA are all equal there:
// LOOP2
for (unsigned i = 0; i < N; i++) {
  accu = a[i] * b + c[i] * d + accu;
}
There may be at least 3 different sequences for the LOOP2:
S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); LATENCY = 2xLAT(FMA) = 2*4
S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU) LATENCY = 2xLAT(ADD) = 2*4
S3: 1xFMA: ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4

In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
before they are ready to be consumed by ADD.
So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.

Example3:
It shows that the heuristics that could be implemented for Example1 and Example2
may be wrong if applied without the whole loop analysis.
// LOOP3
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1;
  accu2 = a2[i] * b + c2[i] * d + accu2;
  accu3 = a3[i] * b + c3[i] * d + accu3;
  accu4 = a4[i] * b + c4[i] * d + accu4;
  accu5 = a5[i] * b + c5[i] * d + accu5;
  accu6 = a6[i] * b + c6[i] * d + accu6;
  accu7 = a7[i] * b + c7[i] * d + accu7;
  accu8 = a8[i] * b + c8[i] * d + accu8;
}
This loop must be tuned for throughput because there are many independent DAGs
putting high pressure on the CPU execution units.
The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
"ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
It works faster because the loop is bounded by throughput.

On SkyLake:
T = approximate throughput of the loop counted in clock-ticks = 
  N * 16 operations / 2 execution units = N*8
L = latency of the loop = 
  N * 2*Lat(FMA) = N*2*4 = N*8
The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).

The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
the time spent in the loop is MAX(L,T) will only be bigger.

Example4:
There may be mixed tuning, i.e. for both throughput and latency in one loop:
// LOOP4
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
  accu2 = a2[i] * b + accu2; // tune for throughput
  accu3 = a3[i] * b + accu3; // tune for throughput
  accu4 = a4[i] * b + accu4; // tune for throughput
  accu5 = a5[i] * b + accu5; // tune for throughput
  accu6 = a6[i] * b + accu6; // tune for throughput
}
On Haswell:
If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then
Latency of the loop is L = N*2*Latency(FMA) = N*2*5   
Throughput T = N * 7 / 2
MAX (L,T) = N*10
Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to
L = N*Latency(FMA) = N*5,
and will only slightly increase T from N*3.5 to 
T = N * 8 operations / 2 execution units = N*4
As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.

Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).
L = N*Latency(ADD) = N*3
T = N * 13 operations / 2 = N*6.5
MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.
So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.
`

In D25291#564563, @evandro wrote:

Everyone's testing should not gate M1.

No, but this patch has been reverted before, and I'd rather wait for Eric's comment before pushing this again.

Making it a hidden disabled option would give you the ability to enable downstream and us the ability to test on other cores / situations, and wouldn't break Eric's tests.

cheers,
--renato

evandro added inline comments.Oct 7 2016, 1:14 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
645	I guess that this should be qualified with `STI.hasNEON()`...
4622	... and this check removed.
4629	I guess that target information should not be present here anymore. But do `f16` types make sense when, though supported by the target, `TargetRecip` does not support them?

evandro added inline comments.Oct 7 2016, 3:24 PM

llvm/test/CodeGen/AArch64/sqrt-fastmath.ll
2	s/reverse/reciprocal/

spatel mentioned this in D24816: [Target] move reciprocal estimate settings from TargetOptions to TargetLowering.Oct 9 2016, 11:30 AM

evandro marked 4 inline comments as done.Oct 17 2016, 8:58 AM

evandro updated this revision to Diff 74871.Oct 17 2016, 12:29 PM

evandro edited edge metadata.

Herald added a subscriber: mehdi_amini. · View Herald TranscriptOct 17 2016, 12:29 PM

evandro mentioned this in D22975: [DAG Combiner] Fix the native computation of the Newton series for reciprocals.Oct 17 2016, 12:41 PM

In D25291#564633, @rengolin wrote:

No, but this patch has been reverted before, and I'd rather wait for Eric's comment before pushing this again.

@echristo, @rengolin insists on your input.

Note that I recommitted D25440 at rL284746 which will affect this patch - should make it smaller. :)
I think that commit will stick this time; the bots that failed with the earlier version appear happy now.

In D25291#575729, @spatel wrote:

Note that I recommitted D25440 at rL284746 which will affect this patch - should make it smaller. :)

Thank you for the heads up.

evandro updated this revision to Diff 75435.Oct 21 2016, 8:25 AM

evandro retitled this revision from [AArch64] Optionally use the reciprocal estimation machinery to [AArch64] Optionally use the Newton series for reciprocal estimation.

evandro edited edge metadata.

LGTM.

Thanks!

-eric

Closed by commit rL284986: [AArch64] Optionally use the Newton series for reciprocal estimation (authored by evandro). · Explain WhyOct 24 2016, 9:24 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64.td

5 lines

AArch64ISelLowering.h

9 lines

AArch64ISelLowering.cpp

63 lines

AArch64InstrInfo.td

29 lines

AArch64Subtarget.h

2 lines

test/

CodeGen/

AArch64/

recp-fastmath.ll

148 lines

sqrt-fastmath.ll

228 lines

Diff 74871

llvm/lib/Target/AArch64/AArch64.td

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines
def FeatureArithmeticCbzFusion : SubtargetFeature<		def FeatureArithmeticCbzFusion : SubtargetFeature<
"arith-cbz-fusion", "HasArithmeticCbzFusion", "true",		"arith-cbz-fusion", "HasArithmeticCbzFusion", "true",
"CPU fuses arithmetic + cbz/cbnz operations">;		"CPU fuses arithmetic + cbz/cbnz operations">;

def FeatureDisableLatencySchedHeuristic : SubtargetFeature<		def FeatureDisableLatencySchedHeuristic : SubtargetFeature<
"disable-latency-sched-heuristic", "DisableLatencySchedHeuristic", "true",		"disable-latency-sched-heuristic", "DisableLatencySchedHeuristic", "true",
"Disable latency scheduling heuristic">;		"Disable latency scheduling heuristic">;

		def FeatureUseRSqrt : SubtargetFeature<
		"use-reciprocal-square-root", "UseRSqrt", "true",
		"Use the reciprocal square root approximation">;
		spatelUnsubmitted Done Reply Inline Actions reverse -> reciprocal ? spatel: reverse -> reciprocal ?

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Architectures.		// Architectures.
//		//

def HasV8_1aOps : SubtargetFeature<"v8.1a", "HasV8_1aOps", "true",		def HasV8_1aOps : SubtargetFeature<"v8.1a", "HasV8_1aOps", "true",
"Support ARM v8.1a instructions", [FeatureCRC]>;		"Support ARM v8.1a instructions", [FeatureCRC]>;

def HasV8_2aOps : SubtargetFeature<"v8.2a", "HasV8_2aOps", "true",		def HasV8_2aOps : SubtargetFeature<"v8.2a", "HasV8_2aOps", "true",
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	def ProcExynosM1 : SubtargetFeature<"exynosm1", "ARMProcFamily", "ExynosM1",
FeatureAvoidQuadLdStPairs,		FeatureAvoidQuadLdStPairs,
FeatureCRC,		FeatureCRC,
FeatureCrypto,		FeatureCrypto,
FeatureCustomCheapAsMoveHandling,		FeatureCustomCheapAsMoveHandling,
FeatureFPARMv8,		FeatureFPARMv8,
FeatureNEON,		FeatureNEON,
FeaturePerfMon,		FeaturePerfMon,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
		FeatureUseRSqrt,
FeatureZCZeroing		FeatureZCZeroing
]>;		]>;

def ProcKryo : SubtargetFeature<"kryo", "ARMProcFamily", "Kryo",		def ProcKryo : SubtargetFeature<"kryo", "ARMProcFamily", "Kryo",
"Qualcomm Kryo processors", [		"Qualcomm Kryo processors", [
FeatureCRC,		FeatureCRC,
FeatureCrypto,		FeatureCrypto,
FeatureCustomCheapAsMoveHandling,		FeatureCustomCheapAsMoveHandling,
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 181 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
/// generated to compensate for the byte-swapping. But sometimes we do		/// generated to compensate for the byte-swapping. But sometimes we do
/// need to re-interpret the data in SIMD vector registers in big-endian		/// need to re-interpret the data in SIMD vector registers in big-endian
/// mode without emitting such REV instructions.		/// mode without emitting such REV instructions.
NVCAST,		NVCAST,

SMULL,		SMULL,
UMULL,		UMULL,

		// Reciprocal estimates.
		FRECPE,
		FRSQRTE,

// NEON Load/Store with post-increment base updates		// NEON Load/Store with post-increment base updates
LD2post = ISD::FIRST_TARGET_MEMORY_OPCODE,		LD2post = ISD::FIRST_TARGET_MEMORY_OPCODE,
LD3post,		LD3post,
LD4post,		LD4post,
ST2post,		ST2post,
ST3post,		ST3post,
ST4post,		ST4post,
LD1x2post,		LD1x2post,
▲ Show 20 Lines • Show All 329 Lines • ▼ Show 20 Lines	private:
SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorAND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorAND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;

SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,		SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
std::vector<SDNode > Created) const override;		std::vector<SDNode > Created) const override;
		SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
		unsigned &RefinementSteps,
		bool &UseOneConstNR) const override;
		SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
		unsigned &RefinementSteps) const override;
unsigned combineRepeatedFPDivisors() const override;		unsigned combineRepeatedFPDivisors() const override;

ConstraintType getConstraintType(StringRef Constraint) const override;		ConstraintType getConstraintType(StringRef Constraint) const override;
unsigned getRegisterByName(const char* RegName, EVT VT,		unsigned getRegisterByName(const char* RegName, EVT VT,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;

/// Examine constraint string and operand type and determine a weight value.		/// Examine constraint string and operand type and determine a weight value.
/// The operand object must already have been set up with the operand type.		/// The operand object must already have been set up with the operand type.
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 27 Lines
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetOptions.h"		#include "llvm/Target/TargetOptions.h"
		#include "llvm/Target/TargetRecip.h"
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "aarch64-lower"		#define DEBUG_TYPE "aarch64-lower"

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");
STATISTIC(NumShiftInserts, "Number of vector shift inserts");		STATISTIC(NumShiftInserts, "Number of vector shift inserts");

static cl::opt<bool>		static cl::opt<bool>
▲ Show 20 Lines • Show All 585 Lines • ▼ Show 20 Lines	for (MVT Ty : {MVT::v2f32, MVT::v4f32, MVT::v2f64}) {
setOperationAction(ISD::FNEARBYINT, Ty, Legal);		setOperationAction(ISD::FNEARBYINT, Ty, Legal);
setOperationAction(ISD::FCEIL, Ty, Legal);		setOperationAction(ISD::FCEIL, Ty, Legal);
setOperationAction(ISD::FRINT, Ty, Legal);		setOperationAction(ISD::FRINT, Ty, Legal);
setOperationAction(ISD::FTRUNC, Ty, Legal);		setOperationAction(ISD::FTRUNC, Ty, Legal);
setOperationAction(ISD::FROUND, Ty, Legal);		setOperationAction(ISD::FROUND, Ty, Legal);
}		}
}		}

		// For the reciprocal estimates, convergence is quadratic, so the number of
		// digits is doubled after each iteration. In ARMv8, the accuracy of the
		// initial estimate is 2^-8. Thus the number of extra steps to refine the
		// result for float (23 mantissa bits) is 2 and for double (52 mantissa bits)
		// is 3.
		const unsigned ExtraStepsF = 2,
		ExtraStepsD = 3;

		evandroAuthorUnsubmitted Done Reply Inline Actions I guess that this should be qualified with `STI.hasNEON()`... evandro: I guess that this should be qualified with `STI.hasNEON()`...
		const bool UseRSqrt = STI.useRSqrt() && STI.hasNEON();
		ReciprocalEstimates.set("sqrtf", UseRSqrt, ExtraStepsF);
		ReciprocalEstimates.set("sqrtd", UseRSqrt, ExtraStepsD);
		ReciprocalEstimates.set("vec-sqrtf", UseRSqrt, ExtraStepsF);
		ReciprocalEstimates.set("vec-sqrtd", UseRSqrt, ExtraStepsD);

		// Using the reciprocal estimates for division breaks too many programs in the
		// wild, so it's unlikely that it should be a feature. It's better left to
		// user to weigh this choice.
		ReciprocalEstimates.set("divf", false, ExtraStepsF);
		ReciprocalEstimates.set("divd", false, ExtraStepsD);
		ReciprocalEstimates.set("vec-divf", false, ExtraStepsF);
		ReciprocalEstimates.set("vec-divd", false, ExtraStepsD);

PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();		PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();
}		}

void AArch64TargetLowering::addTypeForNEON(MVT VT, MVT PromotedBitwiseVT) {		void AArch64TargetLowering::addTypeForNEON(MVT VT, MVT PromotedBitwiseVT) {
if (VT == MVT::v2f32 \|\| VT == MVT::v4f16) {		if (VT == MVT::v2f32 \|\| VT == MVT::v4f16) {
setOperationAction(ISD::LOAD, VT, Promote);		setOperationAction(ISD::LOAD, VT, Promote);
AddPromotedToType(ISD::LOAD, VT, MVT::v2i32);		AddPromotedToType(ISD::LOAD, VT, MVT::v2i32);

▲ Show 20 Lines • Show All 309 Lines • ▼ Show 20 Lines	const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
case AArch64ISD::LD2LANEpost: return "AArch64ISD::LD2LANEpost";		case AArch64ISD::LD2LANEpost: return "AArch64ISD::LD2LANEpost";
case AArch64ISD::LD3LANEpost: return "AArch64ISD::LD3LANEpost";		case AArch64ISD::LD3LANEpost: return "AArch64ISD::LD3LANEpost";
case AArch64ISD::LD4LANEpost: return "AArch64ISD::LD4LANEpost";		case AArch64ISD::LD4LANEpost: return "AArch64ISD::LD4LANEpost";
case AArch64ISD::ST2LANEpost: return "AArch64ISD::ST2LANEpost";		case AArch64ISD::ST2LANEpost: return "AArch64ISD::ST2LANEpost";
case AArch64ISD::ST3LANEpost: return "AArch64ISD::ST3LANEpost";		case AArch64ISD::ST3LANEpost: return "AArch64ISD::ST3LANEpost";
case AArch64ISD::ST4LANEpost: return "AArch64ISD::ST4LANEpost";		case AArch64ISD::ST4LANEpost: return "AArch64ISD::ST4LANEpost";
case AArch64ISD::SMULL: return "AArch64ISD::SMULL";		case AArch64ISD::SMULL: return "AArch64ISD::SMULL";
case AArch64ISD::UMULL: return "AArch64ISD::UMULL";		case AArch64ISD::UMULL: return "AArch64ISD::UMULL";
		case AArch64ISD::FRSQRTE: return "AArch64ISD::FRSQRTE";
		case AArch64ISD::FRECPE: return "AArch64ISD::FRECPE";
}		}
return nullptr;		return nullptr;
}		}

MachineBasicBlock *		MachineBasicBlock *
AArch64TargetLowering::EmitF128CSEL(MachineInstr &MI,		AArch64TargetLowering::EmitF128CSEL(MachineInstr &MI,
MachineBasicBlock *MBB) const {		MachineBasicBlock *MBB) const {
// We materialise the F128CSEL pseudo-instruction as some control flow and a		// We materialise the F128CSEL pseudo-instruction as some control flow and a
▲ Show 20 Lines • Show All 3,614 Lines • ▼ Show 20 Lines	else if (VT == MVT::f32)
return AArch64_AM::getFP32Imm(Imm) != -1;		return AArch64_AM::getFP32Imm(Imm) != -1;
return false;		return false;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AArch64 Optimization Hooks		// AArch64 Optimization Hooks
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		/// Return the appropriate estimate DAG for either the reciprocal
		/// or the reciprocal square root.
		static SDValue getEstimate(const AArch64TargetLowering::DAGCombinerInfo &DCI,
		TargetRecip &Recip, unsigned Opcode,
		const SDValue &Operand, unsigned &ExtraSteps) {
		EVT VT = Operand.getValueType();
		evandroAuthorUnsubmitted Done Reply Inline Actions ... and this check removed. evandro: ... and this check removed.
		if (VT != MVT::f64 && VT != MVT::v1f64 && VT != MVT::v2f64 &&
		VT != MVT::f32 && VT != MVT::v1f32 &&
		VT != MVT::v2f32 && VT != MVT::v4f32)
		return SDValue();

		std::string RecipOp;
		RecipOp = Opcode == (AArch64ISD::FRECPE) ? "div": "sqrt";
		evandroAuthorUnsubmitted Done Reply Inline Actions I guess that target information should not be present here anymore. But do `f16` types make sense when, though supported by the target, `TargetRecip` does not support them? evandro: I guess that target information should not be present here anymore. But do `f16` types make…
		RecipOp = ((VT.isVector()) ? "vec-": "") + RecipOp;
		RecipOp += (VT.getScalarType() == MVT::f64) ? "d": "f";

		if (!Recip.isEnabled(RecipOp))
		return SDValue();

		ExtraSteps = Recip.getRefinementSteps(RecipOp);
		return DCI.DAG.getNode(Opcode, SDLoc(Operand), VT, Operand);
		}

		SDValue AArch64TargetLowering::getRecipEstimate(SDValue Operand,
		DAGCombinerInfo &DCI, unsigned &ExtraSteps) const {
		TargetRecip Recip = getTargetRecipForFunc(DCI.DAG.getMachineFunction());

		return getEstimate(DCI, Recip, AArch64ISD::FRECPE, Operand, ExtraSteps);
		}

		SDValue AArch64TargetLowering::getRsqrtEstimate(SDValue Operand,
		DAGCombinerInfo &DCI, unsigned &ExtraSteps, bool &UseOneConst) const {
		TargetRecip Recip = getTargetRecipForFunc(DCI.DAG.getMachineFunction());

		UseOneConst = true;
		return getEstimate(DCI, Recip, AArch64ISD::FRSQRTE, Operand, ExtraSteps);
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AArch64 Inline Assembly Support		// AArch64 Inline Assembly Support
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// Table of Constraints		// Table of Constraints
// TODO: This is the current set of constraints supported by ARM for the		// TODO: This is the current set of constraints supported by ARM for the
// compiler, not all of them may make sense, e.g. S may be difficult to support.		// compiler, not all of them may make sense, e.g. S may be difficult to support.
//		//
▲ Show 20 Lines • Show All 5,788 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines

	def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;			def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;

	def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,			def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,
	SDTCisSameAs<1, 2>]>;			SDTCisSameAs<1, 2>]>;
	def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull>;			def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull>;
	def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull>;			def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull>;

				def AArch64frecpe : SDNode<"AArch64ISD::FRECPE", SDTFPUnaryOp>;
				def AArch64frsqrte : SDNode<"AArch64ISD::FRSQRTE", SDTFPUnaryOp>;

	def AArch64saddv : SDNode<"AArch64ISD::SADDV", SDT_AArch64UnaryVec>;			def AArch64saddv : SDNode<"AArch64ISD::SADDV", SDT_AArch64UnaryVec>;
	def AArch64uaddv : SDNode<"AArch64ISD::UADDV", SDT_AArch64UnaryVec>;			def AArch64uaddv : SDNode<"AArch64ISD::UADDV", SDT_AArch64UnaryVec>;
	def AArch64sminv : SDNode<"AArch64ISD::SMINV", SDT_AArch64UnaryVec>;			def AArch64sminv : SDNode<"AArch64ISD::SMINV", SDT_AArch64UnaryVec>;
	def AArch64uminv : SDNode<"AArch64ISD::UMINV", SDT_AArch64UnaryVec>;			def AArch64uminv : SDNode<"AArch64ISD::UMINV", SDT_AArch64UnaryVec>;
	def AArch64smaxv : SDNode<"AArch64ISD::SMAXV", SDT_AArch64UnaryVec>;			def AArch64smaxv : SDNode<"AArch64ISD::SMAXV", SDT_AArch64UnaryVec>;
	def AArch64umaxv : SDNode<"AArch64ISD::UMAXV", SDT_AArch64UnaryVec>;			def AArch64umaxv : SDNode<"AArch64ISD::UMAXV", SDT_AArch64UnaryVec>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 3,104 Lines • ▼ Show 20 Lines

	def : Pat<(f32 (int_aarch64_neon_frecpe (f32 FPR32:$Rn))),			def : Pat<(f32 (int_aarch64_neon_frecpe (f32 FPR32:$Rn))),
	(FRECPEv1i32 FPR32:$Rn)>;			(FRECPEv1i32 FPR32:$Rn)>;
	def : Pat<(f64 (int_aarch64_neon_frecpe (f64 FPR64:$Rn))),			def : Pat<(f64 (int_aarch64_neon_frecpe (f64 FPR64:$Rn))),
	(FRECPEv1i64 FPR64:$Rn)>;			(FRECPEv1i64 FPR64:$Rn)>;
	def : Pat<(v1f64 (int_aarch64_neon_frecpe (v1f64 FPR64:$Rn))),			def : Pat<(v1f64 (int_aarch64_neon_frecpe (v1f64 FPR64:$Rn))),
	(FRECPEv1i64 FPR64:$Rn)>;			(FRECPEv1i64 FPR64:$Rn)>;

				def : Pat<(f32 (AArch64frecpe (f32 FPR32:$Rn))),
				(FRECPEv1i32 FPR32:$Rn)>;
				def : Pat<(v2f32 (AArch64frecpe (v2f32 V64:$Rn))),
				(FRECPEv2f32 V64:$Rn)>;
				def : Pat<(v4f32 (AArch64frecpe (v4f32 FPR128:$Rn))),
				(FRECPEv4f32 FPR128:$Rn)>;
				def : Pat<(f64 (AArch64frecpe (f64 FPR64:$Rn))),
				(FRECPEv1i64 FPR64:$Rn)>;
				def : Pat<(v1f64 (AArch64frecpe (v1f64 FPR64:$Rn))),
				(FRECPEv1i64 FPR64:$Rn)>;
				def : Pat<(v2f64 (AArch64frecpe (v2f64 FPR128:$Rn))),
				(FRECPEv2f64 FPR128:$Rn)>;

	def : Pat<(f32 (int_aarch64_neon_frecpx (f32 FPR32:$Rn))),			def : Pat<(f32 (int_aarch64_neon_frecpx (f32 FPR32:$Rn))),
	(FRECPXv1i32 FPR32:$Rn)>;			(FRECPXv1i32 FPR32:$Rn)>;
	def : Pat<(f64 (int_aarch64_neon_frecpx (f64 FPR64:$Rn))),			def : Pat<(f64 (int_aarch64_neon_frecpx (f64 FPR64:$Rn))),
	(FRECPXv1i64 FPR64:$Rn)>;			(FRECPXv1i64 FPR64:$Rn)>;

	def : Pat<(f32 (int_aarch64_neon_frsqrte (f32 FPR32:$Rn))),			def : Pat<(f32 (int_aarch64_neon_frsqrte (f32 FPR32:$Rn))),
	(FRSQRTEv1i32 FPR32:$Rn)>;			(FRSQRTEv1i32 FPR32:$Rn)>;
	def : Pat<(f64 (int_aarch64_neon_frsqrte (f64 FPR64:$Rn))),			def : Pat<(f64 (int_aarch64_neon_frsqrte (f64 FPR64:$Rn))),
	(FRSQRTEv1i64 FPR64:$Rn)>;			(FRSQRTEv1i64 FPR64:$Rn)>;
	def : Pat<(v1f64 (int_aarch64_neon_frsqrte (v1f64 FPR64:$Rn))),			def : Pat<(v1f64 (int_aarch64_neon_frsqrte (v1f64 FPR64:$Rn))),
	(FRSQRTEv1i64 FPR64:$Rn)>;			(FRSQRTEv1i64 FPR64:$Rn)>;

				def : Pat<(f32 (AArch64frsqrte (f32 FPR32:$Rn))),
				(FRSQRTEv1i32 FPR32:$Rn)>;
				def : Pat<(v2f32 (AArch64frsqrte (v2f32 V64:$Rn))),
				(FRSQRTEv2f32 V64:$Rn)>;
				def : Pat<(v4f32 (AArch64frsqrte (v4f32 FPR128:$Rn))),
				(FRSQRTEv4f32 FPR128:$Rn)>;
				def : Pat<(f64 (AArch64frsqrte (f64 FPR64:$Rn))),
				(FRSQRTEv1i64 FPR64:$Rn)>;
				def : Pat<(v1f64 (AArch64frsqrte (v1f64 FPR64:$Rn))),
				(FRSQRTEv1i64 FPR64:$Rn)>;
				def : Pat<(v2f64 (AArch64frsqrte (v2f64 FPR128:$Rn))),
				(FRSQRTEv2f64 FPR128:$Rn)>;

	// If an integer is about to be converted to a floating point value,			// If an integer is about to be converted to a floating point value,
	// just load it on the floating point unit.			// just load it on the floating point unit.
	// Here are the patterns for 8 and 16-bits to float.			// Here are the patterns for 8 and 16-bits to float.
	// 8-bits -> float.			// 8-bits -> float.
	multiclass UIntToFPROLoadPat<ValueType DstTy, ValueType SrcTy,			multiclass UIntToFPROLoadPat<ValueType DstTy, ValueType SrcTy,
	SDPatternOperator loadop, Instruction UCVTF,			SDPatternOperator loadop, Instruction UCVTF,
	ROAddrMode ro, Instruction LDRW, Instruction LDRX,			ROAddrMode ro, Instruction LDRW, Instruction LDRX,
	SubRegIndex sub> {			SubRegIndex sub> {
	▲ Show 20 Lines • Show All 2,649 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	protected:
bool CustomAsCheapAsMove = false;		bool CustomAsCheapAsMove = false;
bool UsePostRAScheduler = false;		bool UsePostRAScheduler = false;
bool Misaligned128StoreIsSlow = false;		bool Misaligned128StoreIsSlow = false;
bool AvoidQuadLdStPairs = false;		bool AvoidQuadLdStPairs = false;
bool UseAlternateSExtLoadCVTF32Pattern = false;		bool UseAlternateSExtLoadCVTF32Pattern = false;
bool HasArithmeticBccFusion = false;		bool HasArithmeticBccFusion = false;
bool HasArithmeticCbzFusion = false;		bool HasArithmeticCbzFusion = false;
bool DisableLatencySchedHeuristic = false;		bool DisableLatencySchedHeuristic = false;
		bool UseRSqrt = false;
uint8_t MaxInterleaveFactor = 2;		uint8_t MaxInterleaveFactor = 2;
uint8_t VectorInsertExtractBaseCost = 3;		uint8_t VectorInsertExtractBaseCost = 3;
uint16_t CacheLineSize = 0;		uint16_t CacheLineSize = 0;
uint16_t PrefetchDistance = 0;		uint16_t PrefetchDistance = 0;
uint16_t MinPrefetchStride = 1;		uint16_t MinPrefetchStride = 1;
unsigned MaxPrefetchIterationsAhead = UINT_MAX;		unsigned MaxPrefetchIterationsAhead = UINT_MAX;
unsigned PrefFunctionAlignment = 0;		unsigned PrefFunctionAlignment = 0;
unsigned PrefLoopAlignment = 0;		unsigned PrefLoopAlignment = 0;
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	public:
bool hasCustomCheapAsMoveHandling() const { return CustomAsCheapAsMove; }		bool hasCustomCheapAsMoveHandling() const { return CustomAsCheapAsMove; }
bool isMisaligned128StoreSlow() const { return Misaligned128StoreIsSlow; }		bool isMisaligned128StoreSlow() const { return Misaligned128StoreIsSlow; }
bool avoidQuadLdStPairs() const { return AvoidQuadLdStPairs; }		bool avoidQuadLdStPairs() const { return AvoidQuadLdStPairs; }
bool useAlternateSExtLoadCVTF32Pattern() const {		bool useAlternateSExtLoadCVTF32Pattern() const {
return UseAlternateSExtLoadCVTF32Pattern;		return UseAlternateSExtLoadCVTF32Pattern;
}		}
bool hasArithmeticBccFusion() const { return HasArithmeticBccFusion; }		bool hasArithmeticBccFusion() const { return HasArithmeticBccFusion; }
bool hasArithmeticCbzFusion() const { return HasArithmeticCbzFusion; }		bool hasArithmeticCbzFusion() const { return HasArithmeticCbzFusion; }
		bool useRSqrt() const { return UseRSqrt; }
unsigned getMaxInterleaveFactor() const { return MaxInterleaveFactor; }		unsigned getMaxInterleaveFactor() const { return MaxInterleaveFactor; }
unsigned getVectorInsertExtractBaseCost() const {		unsigned getVectorInsertExtractBaseCost() const {
return VectorInsertExtractBaseCost;		return VectorInsertExtractBaseCost;
}		}
unsigned getCacheLineSize() const { return CacheLineSize; }		unsigned getCacheLineSize() const { return CacheLineSize; }
unsigned getPrefetchDistance() const { return PrefetchDistance; }		unsigned getPrefetchDistance() const { return PrefetchDistance; }
unsigned getMinPrefetchStride() const { return MinPrefetchStride; }		unsigned getMinPrefetchStride() const { return MinPrefetchStride; }
unsigned getMaxPrefetchIterationsAhead() const {		unsigned getMaxPrefetchIterationsAhead() const {
▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/recp-fastmath.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64-unknown-linux-gnu -mattr=+neon \| FileCheck %s

				define float @frecp0(float %x) #0 {
				%div = fdiv fast float 1.0, %x
				ret float %div

				; CHECK-LABEL: frecp0:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: fdiv
				}

				define float @frecp1(float %x) #1 {
				%div = fdiv fast float 1.0, %x
				ret float %div

				; CHECK-LABEL: frecp1:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: frecpe
				; CHECK-NEXT: fmov
				}

				define <2 x float> @f2recp0(<2 x float> %x) #0 {
				%div = fdiv fast <2 x float> <float 1.0, float 1.0>, %x
				ret <2 x float> %div

				; CHECK-LABEL: f2recp0:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: fdiv
				}

				define <2 x float> @f2recp1(<2 x float> %x) #1 {
				%div = fdiv fast <2 x float> <float 1.0, float 1.0>, %x
				ret <2 x float> %div

				; CHECK-LABEL: f2recp1:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				}

				define <4 x float> @f4recp0(<4 x float> %x) #0 {
				%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
				ret <4 x float> %div

				; CHECK-LABEL: f4recp0:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: fdiv
				}

				define <4 x float> @f4recp1(<4 x float> %x) #1 {
				%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
				ret <4 x float> %div

				; CHECK-LABEL: f4recp1:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				}

				define <8 x float> @f8recp0(<8 x float> %x) #0 {
				%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
				ret <8 x float> %div

				; CHECK-LABEL: f8recp0:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: fdiv
				; CHECK-NEXT: fdiv
				}

				define <8 x float> @f8recp1(<8 x float> %x) #1 {
				%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
				ret <8 x float> %div

				; CHECK-LABEL: f8recp1:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				; CHECK: frecpe
				}

				define double @drecp0(double %x) #0 {
				%div = fdiv fast double 1.0, %x
				ret double %div

				; CHECK-LABEL: drecp0:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: fdiv
				}

				define double @drecp1(double %x) #1 {
				%div = fdiv fast double 1.0, %x
				ret double %div

				; CHECK-LABEL: drecp1:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: frecpe
				; CHECK-NEXT: fmov
				}

				define <2 x double> @d2recp0(<2 x double> %x) #0 {
				%div = fdiv fast <2 x double> <double 1.0, double 1.0>, %x
				ret <2 x double> %div

				; CHECK-LABEL: d2recp0:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: fdiv
				}

				define <2 x double> @d2recp1(<2 x double> %x) #1 {
				%div = fdiv fast <2 x double> <double 1.0, double 1.0>, %x
				ret <2 x double> %div

				; CHECK-LABEL: d2recp1:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				}

				define <4 x double> @d4recp0(<4 x double> %x) #0 {
				%div = fdiv fast <4 x double> <double 1.0, double 1.0, double 1.0, double 1.0>, %x
				ret <4 x double> %div

				; CHECK-LABEL: d4recp0:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: fdiv
				; CHECK-NEXT: fdiv
				}

				define <4 x double> @d4recp1(<4 x double> %x) #1 {
				%div = fdiv fast <4 x double> <double 1.0, double 1.0, double 1.0, double 1.0>, %x
				ret <4 x double> %div

				; CHECK-LABEL: d4recp1:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				; CHECK: frecpe
				}

				attributes #0 = { nounwind "unsafe-fp-math"="true" }
				attributes #1 = { nounwind "unsafe-fp-math"="true" "reciprocal-estimates"="div,vec-div" }

llvm/test/CodeGen/AArch64/sqrt-fastmath.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64-unknown-linux-gnu -mattr=+neon,-use-reciprocal-square-root \| FileCheck %s --check-prefix=FAULT
				; RUN: llc < %s -mtriple=aarch64-unknown-linux-gnu -mattr=+neon,+use-reciprocal-square-root \| FileCheck %s
				evandroAuthorUnsubmitted Done Reply Inline Actions s/reverse/reciprocal/ evandro: s/reverse/reciprocal/

				declare float @llvm.sqrt.f32(float) #0
				declare <2 x float> @llvm.sqrt.v2f32(<2 x float>) #0
				declare <4 x float> @llvm.sqrt.v4f32(<4 x float>) #0
				declare <8 x float> @llvm.sqrt.v8f32(<8 x float>) #0
				declare double @llvm.sqrt.f64(double) #0
				declare <2 x double> @llvm.sqrt.v2f64(<2 x double>) #0
				declare <4 x double> @llvm.sqrt.v4f64(<4 x double>) #0

				define float @fsqrt(float %a) #0 {
				%1 = tail call fast float @llvm.sqrt.f32(float %a)
				ret float %1

				; FAULT-LABEL: fsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: fsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x float> @f2sqrt(<2 x float> %a) #0 {
				%1 = tail call fast <2 x float> @llvm.sqrt.v2f32(<2 x float> %a)
				ret <2 x float> %1

				; FAULT-LABEL: f2sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f2sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				}

				define <4 x float> @f4sqrt(<4 x float> %a) #0 {
				%1 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %a)
				ret <4 x float> %1

				; FAULT-LABEL: f4sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f4sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				}

				define <8 x float> @f8sqrt(<8 x float> %a) #0 {
				%1 = tail call fast <8 x float> @llvm.sqrt.v8f32(<8 x float> %a)
				ret <8 x float> %1

				; FAULT-LABEL: f8sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f8sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				; CHECK: frsqrte
				}

				define double @dsqrt(double %a) #0 {
				%1 = tail call fast double @llvm.sqrt.f64(double %a)
				ret double %1

				; FAULT-LABEL: dsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: dsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x double> @d2sqrt(<2 x double> %a) #0 {
				%1 = tail call fast <2 x double> @llvm.sqrt.v2f64(<2 x double> %a)
				ret <2 x double> %1

				; FAULT-LABEL: d2sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: d2sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				}

				define <4 x double> @d4sqrt(<4 x double> %a) #0 {
				%1 = tail call fast <4 x double> @llvm.sqrt.v4f64(<4 x double> %a)
				ret <4 x double> %1

				; FAULT-LABEL: d4sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: d4sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				; CHECK: frsqrte
				}

				define float @frsqrt(float %a) #0 {
				%1 = tail call fast float @llvm.sqrt.f32(float %a)
				%2 = fdiv fast float 1.000000e+00, %1
				ret float %2

				; FAULT-LABEL: frsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: frsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x float> @f2rsqrt(<2 x float> %a) #0 {
				%1 = tail call fast <2 x float> @llvm.sqrt.v2f32(<2 x float> %a)
				%2 = fdiv fast <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
				ret <2 x float> %2

				; FAULT-LABEL: f2rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f2rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <4 x float> @f4rsqrt(<4 x float> %a) #0 {
				%1 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %a)
				%2 = fdiv fast <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %1
				ret <4 x float> %2

				; FAULT-LABEL: f4rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f4rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <8 x float> @f8rsqrt(<8 x float> %a) #0 {
				%1 = tail call fast <8 x float> @llvm.sqrt.v8f32(<8 x float> %a)
				%2 = fdiv fast <8 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %1
				ret <8 x float> %2

				; FAULT-LABEL: f8rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f8rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				; CHECK: frsqrte
				}

				define double @drsqrt(double %a) #0 {
				%1 = tail call fast double @llvm.sqrt.f64(double %a)
				%2 = fdiv fast double 1.000000e+00, %1
				ret double %2

				; FAULT-LABEL: drsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: drsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x double> @d2rsqrt(<2 x double> %a) #0 {
				%1 = tail call fast <2 x double> @llvm.sqrt.v2f64(<2 x double> %a)
				%2 = fdiv fast <2 x double> <double 1.000000e+00, double 1.000000e+00>, %1
				ret <2 x double> %2

				; FAULT-LABEL: d2rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: d2rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <4 x double> @d4rsqrt(<4 x double> %a) #0 {
				%1 = tail call fast <4 x double> @llvm.sqrt.v4f64(<4 x double> %a)
				%2 = fdiv fast <4 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %1
				ret <4 x double> %2

				; FAULT-LABEL: d4rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: d4rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				; CHECK: frsqrte
				}

				attributes #0 = { nounwind "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Optionally use the Newton series for reciprocal estimationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 74871

llvm/lib/Target/AArch64/AArch64.td

llvm/lib/Target/AArch64/AArch64ISelLowering.h

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/lib/Target/AArch64/AArch64Subtarget.h

llvm/test/CodeGen/AArch64/recp-fastmath.ll

llvm/test/CodeGen/AArch64/sqrt-fastmath.ll

[AArch64] Optionally use the Newton series for reciprocal estimation
ClosedPublic