This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
3/3
AArch64ISelLowering.cpp
-
AArch64InstrInfo.td
-
AArch64TargetMachine.h
2/2
AArch64TargetMachine.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
recp-fastmath.ll
-
sqrt-fastmath.ll

Differential D19426

[AArch64] Use the reciprocal estimation machinery
ClosedPublic

Authored by evandro on Apr 22 2016, 11:52 AM.

Download Raw Diff

Details

Reviewers

silviu.baranga
rengolin
eastig
t.p.northover
aemerson
jmolloy

Summary

This patch adds support for estimating the square root, its reciprocal and division or reciprocal using the combiner generic reciprocal machinery.

I intend to follow this patch with support for using the respective step instructions. Later, with support to consider the context, specifically whether it's bound by latency or throughput and whether the additional multiplications of the series overwhelm execution units already under pressure.

Diff Detail

Event Timeline

evandro updated this revision to Diff 54692.Apr 22 2016, 11:52 AM

evandro retitled this revision from to [AArch64] Use the reciprocal estimation machiner.

evandro updated this object.

evandro added reviewers: t.p.northover, rengolin, aemerson, eastig, silviu.baranga.

evandro set the repository for this revision to rL LLVM.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptApr 22 2016, 11:52 AM

flyingforyou added a subscriber: llvm-commits.Apr 22 2016, 11:27 PM

flyingforyou added a subscriber: flyingforyou.

evandro retitled this revision from [AArch64] Use the reciprocal estimation machiner to [AArch64] Use the reciprocal estimation machinery.Apr 25 2016, 7:45 AM

Ideally we would enable this everywhere (and not need to add additional features). Do you have any idea what the impact would be on other cores?

Cheers,
Silviu

In D19426#410755, @sbaranga wrote:

Ideally we would enable this everywhere (and not need to add additional features). Do you have any idea what the impact would be on other cores?

I have an idea on A57, where only sqrt(), but neither sqrtf() or any division, would be beneficial to emit the series instead.

Then again, I'm testing the waters by opting to use additional features instead of a sequence of plethora of isCPU(). Perhaps it's time to use features, even of some other kind, so that all such nuances remain in the machine descriptions instead of peppered all over the rest of the source code. Perhaps not.

eastig added inline comments.Apr 26 2016, 5:28 AM

llvm/lib/Target/AArch64/AArch64.td
67 ↗	(On Diff #54692)	I think it's better to use names something like: FeatureReciprocalSqrtEstimate FeatureReciprocalEstimate They don't directly calculate DIV and SQRT. They calculate aproximate 1/DIV and 1/SQRT.
142 ↗	(On Diff #54692)	Why is FeatureApproximateDiv not on the list of the features?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4663	Most of the code of the functions is the same. Only differences are RecipOp string and Opcodes. I would suggest to create a template function or a function with common code. Also documenting parameters will be good. Am I correct they will be used for step versions in the future?
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
202	I suggest to put this code into a separate function(s): initReciprocals. This will help not to make mess if in the future more initialization is added.

Thank you.

llvm/lib/Target/AArch64/AArch64.td
67 ↗	(On Diff #54692)	Do you mean to replace HasApproximateSqrt with FeatureReciprocalSqrtEstimate and so on?
142 ↗	(On Diff #54692)	Because Exynos M1 doesn't benefit from it. It might be beneficial in other cores, when their respective maintainers should be better equipped than me to decide whether to add such features or not.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4663	Will do. I'm currently working on having the step instrs to be emitted instead of discrete code.

Hi Evandro,

Wouldn't this require some form of -ffast-math flag? I assume precision will be affected somehow.

cheers,
--renato

In D19426#411935, @rengolin wrote:

Wouldn't this require some form of -ffast-math flag? I assume precision will be affected somehow.

Indeed it is, but DAGCombiner already makes sure of that.

eastig added inline comments.Apr 26 2016, 6:02 AM

llvm/lib/Target/AArch64/AArch64.td
67 ↗	(On Diff #54692)	Yes. When I first read the names of these definitions I had initial understanding based on their names. Later I saw instructions and from their documentation I found what they do. It didn't match to my initial understanding. I think when someone who is not familiar with these instructions starts looking at the code he or she will have the similar issue.
142 ↗	(On Diff #54692)	Sorry for a naive question. Do we need to have these feature on the list? I thought they are part ARMv8-A NEON.

Thank you.

llvm/lib/Target/AArch64/AArch64.td
67 ↗	(On Diff #54692)	On the other hand, the reciprocals are later used to calculate the square root and division. All features start with "Has", that's why I did the same. However, at the moment, the only features in use are ISA features, whereas this is an operation that, though it depends on an ISA feature, just makes use of it. So, in a way, if it started with "Does" it might be more appropriate. I'd appreciate to find out what others think. But, if it makes too many waves dissipating in every direction, I'd rather fall back to the silly trains of "isCPUNAME" tests.
142 ↗	(On Diff #54692)	They are, but whether it's beneficial to always use them or not, beyond explicit intrinsics, is another matter that would depend on specific sub-targets.

Address the previous comments that were more clearly understood.

Thank you.

Hi,

Then again, I'm testing the waters by opting to use additional features instead of a sequence of plethora of isCPU(). Perhaps it's time to use features, even of some other kind, so that all such nuances remain in the machine descriptions instead of peppered all over the rest of the source code. Perhaps not.

I don't like the use of features here. They are a very large, indiscriminate hammer for when to enable this optimization. I don't know about Exynos M1, but on many chips the decision of whether to use reciprocals or not is contextual.

Often, the iterative SQRT instruction is faster in latency than a reciprocal alternative. Not only because there is less instruction fetch/dispatch/issue overhead but also because the iterative version can exit early in hardware if the NR steps converge quickly. A reciprocal alternative has to have a fixed number of steps which must be enough for the worst case. The reciprocal has the advantage that it is fully pipelined whereas the iterative SQRT might not be.

In my experiments, reciprocals are a poor choice for any situations where there are

(a) few data items to process, or 
(b) the sqrt/div is on the critical path.

So this sequence would be pessimized by changing to reciprocals:

t = 0;
for (...) {
  t = t + a[i];
  t /= b[i];
}

Because the divide is on the critical path and is a loop dependence, the core can never overlap executions of the divide, so changing to reciprocals and extending the critical path would be a lose.

However here:

for (...) {
  a[i] = a[i] / b[i];
}

We can vectorize and unroll this. For this, reciprocals could be a *significant* win.

Your current implementation doesn't consider any of the situations where reciprocals might be beneficial or not, and it can't (because you've moved the heuristic out of TTI/TLI into Subtarget).

Cheers,

James

This revision now requires changes to proceed.Apr 27 2016, 5:04 AM

James,

It seems to me that your objection is not so much against this patch as against the machinery in the DAGCombiner.

I understand your points, even adding that the series takes the pressure from the unit(s) that perform division and square root and puts it unto the unit(s) that perform multiplication.

I do intend to investigate this issue further, but I think that it's incremental, if not tangential, to what this patch proposes.

Finally, I'm interested in understanding better what you think would be more appropriate than using features.

Thank you.

Hi Evandro,

It seems to me that your objection is not so much against this patch as against the machinery in the DAGCombiner.

Perhaps. The important thing is how we decide whether to perform this optimisation or not. A hook function, as we normally use, has the ability to be extended to add more logic. A subtarget feature does not. I wouldn’t like to see optimization decisions being switched on or off by subtarget features.

Cheers,

James

mssimpso added a subscriber: mssimpso.Apr 27 2016, 10:39 AM

In D19426#413925, @jmolloy wrote:

The important thing is how we decide whether to perform this optimisation or not. A hook function, as we normally use, has the ability to be extended to add more logic. A subtarget feature does not. I wouldn’t like to see optimization decisions being switched on or off by subtarget features.

James,

Perhaps. The important thing is how we decide whether to perform this optimisation or not. A hook function, as we normally use, has the ability to be extended to add more logic. A subtarget feature does not. I wouldn’t like to see optimization decisions being switched on or off by subtarget features.

Whether it is to use the series or not in a loop bound by latency, as in the 1st example, and in a loop bound by throughput, as in the 2nd example, or in a sequence with many multiplications is still something that would still depend on the sub-target. For, say, in a sub-target division or square root could take much longer than the series or there are plenty of units capable of multiplications. Hopefully the context of this hook may allow querying the sub-target about such details. I'll start working on this soon.

Still, methinks that this patch is good for some sub-targets as is and it shouldn't be the enemy of "perfection". ;-)

Thank you.

Hi Evandro,

msg-8327-696.txt162 BDownload

I removed the features in this version of the patch and added test cases.

Ahem... and added test cases.

Hi Evandro,

This looks much nicer, thanks! Just a couple of readability issues left from my side.

Cheers,

James

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4636	For readability, please could you put the test here in brackets and have a space before the '?' of the ternary? Same below: std::string RecipOp = (AArch64ISD::FRECPE) ? "div" : "sqrt";
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
150	Please can you extract the heuristic computation from the call here, to make it explicit that's what it is? Something like: // FIXME: Only enable SQRT reciprocals for M1 for the moment. bool UseRSQRTE = ST.isExynosM1(); TM.Options.Reciprocals.setDefaults("sqrtf", UseRSQRTE, ExtraStepsF);

In the meantime, I've been considering how to issue the iterations using FRSQRTS and FRECPS and it a possible way would be to have new SDNode for them, used directly by the DAGCombiner, which is by default lowered as a polynomial or, as for ARM, as the respective instructions. Thoughts?

Thank you.

evandro updated this revision to Diff 56153.May 4 2016, 7:59 AM

evandro removed rL LLVM as the repository for this revision.

LGTM!

In D19426#421150, @evandro wrote:

In the meantime, I've been considering how to issue the iterations using FRSQRTS and FRECPS and it a possible way would be to have new SDNode for them, used directly by the DAGCombiner, which is by default lowered as a polynomial or, as for ARM, as the respective instructions. Thoughts?

Thank you.

How does X86 do this?

This revision is now accepted and ready to land.May 4 2016, 8:08 AM

In D19426#421169, @jmolloy wrote:

How does X86 do this?

It doesn't. AFAIK, ARM is the only target that has an instruction for the series steps. All the other targets that can estimate the initial value, X86, PPC, IA64, MIPS, compute the series steps with polynomials.

Committed initial implementation as r268539.

rengolin closed this revision.Jun 27 2016, 6:53 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

9 lines

AArch64ISelLowering.cpp

36 lines

AArch64InstrInfo.td

29 lines

AArch64TargetMachine.h

2 lines

AArch64TargetMachine.cpp

29 lines

test/

CodeGen/

AArch64/

recp-fastmath.ll

79 lines

sqrt-fastmath.ll

158 lines

Diff 56153

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 181 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
/// generated to compensate for the byte-swapping. But sometimes we do		/// generated to compensate for the byte-swapping. But sometimes we do
/// need to re-interpret the data in SIMD vector registers in big-endian		/// need to re-interpret the data in SIMD vector registers in big-endian
/// mode without emitting such REV instructions.		/// mode without emitting such REV instructions.
NVCAST,		NVCAST,

SMULL,		SMULL,
UMULL,		UMULL,

		// Reciprocal estimates.
		FRECPE,
		FRSQRTE,

// NEON Load/Store with post-increment base updates		// NEON Load/Store with post-increment base updates
LD2post = ISD::FIRST_TARGET_MEMORY_OPCODE,		LD2post = ISD::FIRST_TARGET_MEMORY_OPCODE,
LD3post,		LD3post,
LD4post,		LD4post,
ST2post,		ST2post,
ST3post,		ST3post,
ST4post,		ST4post,
LD1x2post,		LD1x2post,
▲ Show 20 Lines • Show All 308 Lines • ▼ Show 20 Lines	private:
SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorAND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorAND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;

SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,		SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
std::vector<SDNode > Created) const override;		std::vector<SDNode > Created) const override;
		SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
		unsigned &RefinementSteps,
		bool &UseOneConstNR) const override;
		SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
		unsigned &RefinementSteps) const override;
unsigned combineRepeatedFPDivisors() const override;		unsigned combineRepeatedFPDivisors() const override;

ConstraintType getConstraintType(StringRef Constraint) const override;		ConstraintType getConstraintType(StringRef Constraint) const override;
unsigned getRegisterByName(const char* RegName, EVT VT,		unsigned getRegisterByName(const char* RegName, EVT VT,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;

/// Examine constraint string and operand type and determine a weight value.		/// Examine constraint string and operand type and determine a weight value.
/// The operand object must already have been set up with the operand type.		/// The operand object must already have been set up with the operand type.
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 959 Lines • ▼ Show 20 Lines	const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
case AArch64ISD::LD2LANEpost: return "AArch64ISD::LD2LANEpost";		case AArch64ISD::LD2LANEpost: return "AArch64ISD::LD2LANEpost";
case AArch64ISD::LD3LANEpost: return "AArch64ISD::LD3LANEpost";		case AArch64ISD::LD3LANEpost: return "AArch64ISD::LD3LANEpost";
case AArch64ISD::LD4LANEpost: return "AArch64ISD::LD4LANEpost";		case AArch64ISD::LD4LANEpost: return "AArch64ISD::LD4LANEpost";
case AArch64ISD::ST2LANEpost: return "AArch64ISD::ST2LANEpost";		case AArch64ISD::ST2LANEpost: return "AArch64ISD::ST2LANEpost";
case AArch64ISD::ST3LANEpost: return "AArch64ISD::ST3LANEpost";		case AArch64ISD::ST3LANEpost: return "AArch64ISD::ST3LANEpost";
case AArch64ISD::ST4LANEpost: return "AArch64ISD::ST4LANEpost";		case AArch64ISD::ST4LANEpost: return "AArch64ISD::ST4LANEpost";
case AArch64ISD::SMULL: return "AArch64ISD::SMULL";		case AArch64ISD::SMULL: return "AArch64ISD::SMULL";
case AArch64ISD::UMULL: return "AArch64ISD::UMULL";		case AArch64ISD::UMULL: return "AArch64ISD::UMULL";
		case AArch64ISD::FRSQRTE: return "AArch64ISD::FRSQRTE";
		case AArch64ISD::FRECPE: return "AArch64ISD::FRECPE";
}		}
return nullptr;		return nullptr;
}		}

MachineBasicBlock *		MachineBasicBlock *
AArch64TargetLowering::EmitF128CSEL(MachineInstr *MI,		AArch64TargetLowering::EmitF128CSEL(MachineInstr *MI,
MachineBasicBlock *MBB) const {		MachineBasicBlock *MBB) const {
// We materialise the F128CSEL pseudo-instruction as some control flow and a		// We materialise the F128CSEL pseudo-instruction as some control flow and a
▲ Show 20 Lines • Show All 3,639 Lines • ▼ Show 20 Lines	else if (VT == MVT::f32)
return AArch64_AM::getFP32Imm(Imm) != -1;		return AArch64_AM::getFP32Imm(Imm) != -1;
return false;		return false;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AArch64 Optimization Hooks		// AArch64 Optimization Hooks
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		/// getEstimate - Return the appropriate estimate DAG for either the reciprocal
		/// or the reciprocal square root.
		static SDValue getEstimate(const AArch64Subtarget &ST,
		const AArch64TargetLowering::DAGCombinerInfo &DCI, unsigned Opcode,
		const SDValue &Operand, unsigned &ExtraSteps) {
		if (!ST.hasNEON())
		return SDValue();

		EVT VT = Operand.getValueType();

		std::string RecipOp;
		RecipOp = Opcode == (AArch64ISD::FRECPE) ? "div": "sqrt";
		jmolloyUnsubmitted Done Reply Inline Actions For readability, please could you put the test here in brackets and have a space before the '?' of the ternary? Same below: std::string RecipOp = (AArch64ISD::FRECPE) ? "div" : "sqrt"; jmolloy: For readability, please could you put the test here in brackets and have a space before the '?'…
		RecipOp = ((VT.isVector()) ? "vec-": "") + RecipOp;
		RecipOp += (VT.getScalarType() == MVT::f64) ? "d": "f";

		TargetRecip Recips = DCI.DAG.getTarget().Options.Reciprocals;
		if (!Recips.isEnabled(RecipOp))
		return SDValue();

		ExtraSteps = Recips.getRefinementSteps(RecipOp);
		return DCI.DAG.getNode(Opcode, SDLoc(Operand), VT, Operand);
		}

		SDValue AArch64TargetLowering::getRecipEstimate(SDValue Operand,
		DAGCombinerInfo &DCI, unsigned &ExtraSteps) const {
		return getEstimate(*Subtarget, DCI, AArch64ISD::FRECPE, Operand, ExtraSteps);
		}

		SDValue AArch64TargetLowering::getRsqrtEstimate(SDValue Operand,
		DAGCombinerInfo &DCI, unsigned &ExtraSteps, bool &UseOneConst) const {
		UseOneConst = true;
		return getEstimate(*Subtarget, DCI, AArch64ISD::FRSQRTE, Operand, ExtraSteps);
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AArch64 Inline Assembly Support		// AArch64 Inline Assembly Support
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// Table of Constraints		// Table of Constraints
		eastigUnsubmitted Done Reply Inline Actions Most of the code of the functions is the same. Only differences are RecipOp string and Opcodes. I would suggest to create a template function or a function with common code. Also documenting parameters will be good. Am I correct they will be used for step versions in the future? eastig: Most of the code of the functions is the same. Only differences are RecipOp string and Opcodes.
		evandroAuthorUnsubmitted Done Reply Inline Actions Will do. I'm currently working on having the step instrs to be emitted instead of discrete code. evandro: Will do. I'm currently working on having the step instrs to be emitted instead of discrete…
// TODO: This is the current set of constraints supported by ARM for the		// TODO: This is the current set of constraints supported by ARM for the
// compiler, not all of them may make sense, e.g. S may be difficult to support.		// compiler, not all of them may make sense, e.g. S may be difficult to support.
//		//
// r - A general register		// r - A general register
// w - An FP/SIMD register of some size in the range v0-v31		// w - An FP/SIMD register of some size in the range v0-v31
// x - An FP/SIMD register of some size in the range v0-v15		// x - An FP/SIMD register of some size in the range v0-v15
// I - Constant that can be used with an ADD instruction		// I - Constant that can be used with an ADD instruction
// J - Constant that can be used with a SUB instruction		// J - Constant that can be used with a SUB instruction
▲ Show 20 Lines • Show All 5,713 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines

	def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;			def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;

	def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,			def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,
	SDTCisSameAs<1, 2>]>;			SDTCisSameAs<1, 2>]>;
	def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull>;			def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull>;
	def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull>;			def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull>;

				def AArch64frecpe : SDNode<"AArch64ISD::FRECPE", SDTFPUnaryOp>;
				def AArch64frsqrte : SDNode<"AArch64ISD::FRSQRTE", SDTFPUnaryOp>;

	def AArch64saddv : SDNode<"AArch64ISD::SADDV", SDT_AArch64UnaryVec>;			def AArch64saddv : SDNode<"AArch64ISD::SADDV", SDT_AArch64UnaryVec>;
	def AArch64uaddv : SDNode<"AArch64ISD::UADDV", SDT_AArch64UnaryVec>;			def AArch64uaddv : SDNode<"AArch64ISD::UADDV", SDT_AArch64UnaryVec>;
	def AArch64sminv : SDNode<"AArch64ISD::SMINV", SDT_AArch64UnaryVec>;			def AArch64sminv : SDNode<"AArch64ISD::SMINV", SDT_AArch64UnaryVec>;
	def AArch64uminv : SDNode<"AArch64ISD::UMINV", SDT_AArch64UnaryVec>;			def AArch64uminv : SDNode<"AArch64ISD::UMINV", SDT_AArch64UnaryVec>;
	def AArch64smaxv : SDNode<"AArch64ISD::SMAXV", SDT_AArch64UnaryVec>;			def AArch64smaxv : SDNode<"AArch64ISD::SMAXV", SDT_AArch64UnaryVec>;
	def AArch64umaxv : SDNode<"AArch64ISD::UMAXV", SDT_AArch64UnaryVec>;			def AArch64umaxv : SDNode<"AArch64ISD::UMAXV", SDT_AArch64UnaryVec>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 3,102 Lines • ▼ Show 20 Lines

	def : Pat<(f32 (int_aarch64_neon_frecpe (f32 FPR32:$Rn))),			def : Pat<(f32 (int_aarch64_neon_frecpe (f32 FPR32:$Rn))),
	(FRECPEv1i32 FPR32:$Rn)>;			(FRECPEv1i32 FPR32:$Rn)>;
	def : Pat<(f64 (int_aarch64_neon_frecpe (f64 FPR64:$Rn))),			def : Pat<(f64 (int_aarch64_neon_frecpe (f64 FPR64:$Rn))),
	(FRECPEv1i64 FPR64:$Rn)>;			(FRECPEv1i64 FPR64:$Rn)>;
	def : Pat<(v1f64 (int_aarch64_neon_frecpe (v1f64 FPR64:$Rn))),			def : Pat<(v1f64 (int_aarch64_neon_frecpe (v1f64 FPR64:$Rn))),
	(FRECPEv1i64 FPR64:$Rn)>;			(FRECPEv1i64 FPR64:$Rn)>;

				def : Pat<(f32 (AArch64frecpe (f32 FPR32:$Rn))),
				(FRECPEv1i32 FPR32:$Rn)>;
				def : Pat<(v2f32 (AArch64frecpe (v2f32 V64:$Rn))),
				(FRECPEv2f32 V64:$Rn)>;
				def : Pat<(v4f32 (AArch64frecpe (v4f32 FPR128:$Rn))),
				(FRECPEv4f32 FPR128:$Rn)>;
				def : Pat<(f64 (AArch64frecpe (f64 FPR64:$Rn))),
				(FRECPEv1i64 FPR64:$Rn)>;
				def : Pat<(v1f64 (AArch64frecpe (v1f64 FPR64:$Rn))),
				(FRECPEv1i64 FPR64:$Rn)>;
				def : Pat<(v2f64 (AArch64frecpe (v2f64 FPR128:$Rn))),
				(FRECPEv2f64 FPR128:$Rn)>;

	def : Pat<(f32 (int_aarch64_neon_frecpx (f32 FPR32:$Rn))),			def : Pat<(f32 (int_aarch64_neon_frecpx (f32 FPR32:$Rn))),
	(FRECPXv1i32 FPR32:$Rn)>;			(FRECPXv1i32 FPR32:$Rn)>;
	def : Pat<(f64 (int_aarch64_neon_frecpx (f64 FPR64:$Rn))),			def : Pat<(f64 (int_aarch64_neon_frecpx (f64 FPR64:$Rn))),
	(FRECPXv1i64 FPR64:$Rn)>;			(FRECPXv1i64 FPR64:$Rn)>;

	def : Pat<(f32 (int_aarch64_neon_frsqrte (f32 FPR32:$Rn))),			def : Pat<(f32 (int_aarch64_neon_frsqrte (f32 FPR32:$Rn))),
	(FRSQRTEv1i32 FPR32:$Rn)>;			(FRSQRTEv1i32 FPR32:$Rn)>;
	def : Pat<(f64 (int_aarch64_neon_frsqrte (f64 FPR64:$Rn))),			def : Pat<(f64 (int_aarch64_neon_frsqrte (f64 FPR64:$Rn))),
	(FRSQRTEv1i64 FPR64:$Rn)>;			(FRSQRTEv1i64 FPR64:$Rn)>;
	def : Pat<(v1f64 (int_aarch64_neon_frsqrte (v1f64 FPR64:$Rn))),			def : Pat<(v1f64 (int_aarch64_neon_frsqrte (v1f64 FPR64:$Rn))),
	(FRSQRTEv1i64 FPR64:$Rn)>;			(FRSQRTEv1i64 FPR64:$Rn)>;

				def : Pat<(f32 (AArch64frsqrte (f32 FPR32:$Rn))),
				(FRSQRTEv1i32 FPR32:$Rn)>;
				def : Pat<(v2f32 (AArch64frsqrte (v2f32 V64:$Rn))),
				(FRSQRTEv2f32 V64:$Rn)>;
				def : Pat<(v4f32 (AArch64frsqrte (v4f32 FPR128:$Rn))),
				(FRSQRTEv4f32 FPR128:$Rn)>;
				def : Pat<(f64 (AArch64frsqrte (f64 FPR64:$Rn))),
				(FRSQRTEv1i64 FPR64:$Rn)>;
				def : Pat<(v1f64 (AArch64frsqrte (v1f64 FPR64:$Rn))),
				(FRSQRTEv1i64 FPR64:$Rn)>;
				def : Pat<(v2f64 (AArch64frsqrte (v2f64 FPR128:$Rn))),
				(FRSQRTEv2f64 FPR128:$Rn)>;

	// If an integer is about to be converted to a floating point value,			// If an integer is about to be converted to a floating point value,
	// just load it on the floating point unit.			// just load it on the floating point unit.
	// Here are the patterns for 8 and 16-bits to float.			// Here are the patterns for 8 and 16-bits to float.
	// 8-bits -> float.			// 8-bits -> float.
	multiclass UIntToFPROLoadPat<ValueType DstTy, ValueType SrcTy,			multiclass UIntToFPROLoadPat<ValueType DstTy, ValueType SrcTy,
	SDPatternOperator loadop, Instruction UCVTF,			SDPatternOperator loadop, Instruction UCVTF,
	ROAddrMode ro, Instruction LDRW, Instruction LDRX,			ROAddrMode ro, Instruction LDRW, Instruction LDRX,
	SubRegIndex sub> {			SubRegIndex sub> {
	▲ Show 20 Lines • Show All 2,666 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetMachine.h

Show All 40 Lines	public:
/// \brief Get the TargetIRAnalysis for this target.		/// \brief Get the TargetIRAnalysis for this target.
TargetIRAnalysis getTargetIRAnalysis() override;		TargetIRAnalysis getTargetIRAnalysis() override;

TargetLoweringObjectFile* getObjFileLowering() const override {		TargetLoweringObjectFile* getObjFileLowering() const override {
return TLOF.get();		return TLOF.get();
}		}

private:		private:
bool isLittle;		AArch64Subtarget Subtarget;
};		};

// AArch64leTargetMachine - AArch64 little endian target machine.		// AArch64leTargetMachine - AArch64 little endian target machine.
//		//
class AArch64leTargetMachine : public AArch64TargetMachine {		class AArch64leTargetMachine : public AArch64TargetMachine {
virtual void anchor();		virtual void anchor();
public:		public:
AArch64leTargetMachine(const Target &T, const Triple &TT, StringRef CPU,		AArch64leTargetMachine(const Target &T, const Triple &TT, StringRef CPU,
Show All 19 Lines

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp

	Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines
	static std::string computeDataLayout(const Triple &TT, bool LittleEndian) {			static std::string computeDataLayout(const Triple &TT, bool LittleEndian) {
	if (TT.isOSBinFormatMachO())			if (TT.isOSBinFormatMachO())
	return "e-m:o-i64:64-i128:128-n32:64-S128";			return "e-m:o-i64:64-i128:128-n32:64-S128";
	if (LittleEndian)			if (LittleEndian)
	return "e-m:e-i64:64-i128:128-n32:64-S128";			return "e-m:e-i64:64-i128:128-n32:64-S128";
	return "E-m:e-i64:64-i128:128-n32:64-S128";			return "E-m:e-i64:64-i128:128-n32:64-S128";
	}			}

				// Helper function to set up the defaults for reciprocals.
				static void initReciprocals(AArch64TargetMachine& TM, AArch64Subtarget& ST)
				{
				// For the estimates, convergence is quadratic, so essentially the number of
				// digits is doubled after each iteration. ARMv8, the minimum architected
				// accuracy of the initial estimate is 2^-8. Therefore, the number of extra
				// steps to refine the result for float (23 mantissa bits) and for double
				// (52 mantissa bits) are 2 and 3, respectively.
				unsigned ExtraStepsF = 2,
				ExtraStepsD = ExtraStepsF + 1;
				// FIXME: Enable x^-1/2 only for Exynos M1 at the moment.
				bool UseRsqrt = ST.isExynosM1();
				jmolloyUnsubmitted Done Reply Inline Actions Please can you extract the heuristic computation from the call here, to make it explicit that's what it is? Something like: // FIXME: Only enable SQRT reciprocals for M1 for the moment. bool UseRSQRTE = ST.isExynosM1(); TM.Options.Reciprocals.setDefaults("sqrtf", UseRSQRTE, ExtraStepsF); jmolloy: Please can you extract the heuristic computation from the call here, to make it explicit that's…

				TM.Options.Reciprocals.setDefaults("sqrtf", UseRsqrt, ExtraStepsF);
				TM.Options.Reciprocals.setDefaults("sqrtd", UseRsqrt, ExtraStepsD);
				TM.Options.Reciprocals.setDefaults("vec-sqrtf", UseRsqrt, ExtraStepsF);
				TM.Options.Reciprocals.setDefaults("vec-sqrtd", UseRsqrt, ExtraStepsD);

				TM.Options.Reciprocals.setDefaults("divf", false, ExtraStepsF);
				TM.Options.Reciprocals.setDefaults("divd", false, ExtraStepsD);
				TM.Options.Reciprocals.setDefaults("vec-divf", false, ExtraStepsF);
				TM.Options.Reciprocals.setDefaults("vec-divd", false, ExtraStepsD);
				}

	/// TargetMachine ctor - Create an AArch64 architecture model.			/// TargetMachine ctor - Create an AArch64 architecture model.
	///			///
	AArch64TargetMachine::AArch64TargetMachine(const Target &T, const Triple &TT,			AArch64TargetMachine::AArch64TargetMachine(const Target &T, const Triple &TT,
	StringRef CPU, StringRef FS,			StringRef CPU, StringRef FS,
	const TargetOptions &Options,			const TargetOptions &Options,
	Reloc::Model RM, CodeModel::Model CM,			Reloc::Model RM, CodeModel::Model CM,
	CodeGenOpt::Level OL,			CodeGenOpt::Level OL,
	bool LittleEndian)			bool LittleEndian)
	// This nested ternary is horrible, but DL needs to be properly			// This nested ternary is horrible, but DL needs to be properly
	// initialized before TLInfo is constructed.			// initialized before TLInfo is constructed.
	: LLVMTargetMachine(T, computeDataLayout(TT, LittleEndian), TT, CPU, FS,			: LLVMTargetMachine(T, computeDataLayout(TT, LittleEndian), TT, CPU, FS,
	Options, RM, CM, OL),			Options, RM, CM, OL),
	TLOF(createTLOF(getTargetTriple())),			TLOF(createTLOF(getTargetTriple())),
	isLittle(LittleEndian) {			Subtarget(TT, CPU, FS, *this, LittleEndian) {
				initReciprocals(*this, Subtarget);
	initAsmInfo();			initAsmInfo();
	}			}

	AArch64TargetMachine::~AArch64TargetMachine() {}			AArch64TargetMachine::~AArch64TargetMachine() {}

	#ifdef LLVM_BUILD_GLOBAL_ISEL			#ifdef LLVM_BUILD_GLOBAL_ISEL
	namespace {			namespace {
	struct AArch64GISelActualAccessor : public GISelAccessor {			struct AArch64GISelActualAccessor : public GISelAccessor {
	std::unique_ptr<CallLowering> CallLoweringInfo;			std::unique_ptr<CallLowering> CallLoweringInfo;
	std::unique_ptr<RegisterBankInfo> RegBankInfo;			std::unique_ptr<RegisterBankInfo> RegBankInfo;
	const CallLowering *getCallLowering() const override {			const CallLowering *getCallLowering() const override {
	return CallLoweringInfo.get();			return CallLoweringInfo.get();
	}			}
	const RegisterBankInfo *getRegBankInfo() const override {			const RegisterBankInfo *getRegBankInfo() const override {
	return RegBankInfo.get();			return RegBankInfo.get();
	}			}
	};			};
	} // End anonymous namespace.			} // End anonymous namespace.
	#endif			#endif

	const AArch64Subtarget *			const AArch64Subtarget *
	AArch64TargetMachine::getSubtargetImpl(const Function &F) const {			AArch64TargetMachine::getSubtargetImpl(const Function &F) const {
	Attribute CPUAttr = F.getFnAttribute("target-cpu");			Attribute CPUAttr = F.getFnAttribute("target-cpu");
	Attribute FSAttr = F.getFnAttribute("target-features");			Attribute FSAttr = F.getFnAttribute("target-features");

				eastigUnsubmitted Done Reply Inline Actions I suggest to put this code into a separate function(s): initReciprocals. This will help not to make mess if in the future more initialization is added. eastig: I suggest to put this code into a separate function(s): initReciprocals. This will help not to…
	std::string CPU = !CPUAttr.hasAttribute(Attribute::None)			std::string CPU = !CPUAttr.hasAttribute(Attribute::None)
	? CPUAttr.getValueAsString().str()			? CPUAttr.getValueAsString().str()
	: TargetCPU;			: TargetCPU;
	std::string FS = !FSAttr.hasAttribute(Attribute::None)			std::string FS = !FSAttr.hasAttribute(Attribute::None)
	? FSAttr.getValueAsString().str()			? FSAttr.getValueAsString().str()
	: TargetFS;			: TargetFS;

	auto &I = SubtargetMap[CPU + FS];			auto &I = SubtargetMap[CPU + FS];
	if (!I) {			if (!I) {
	// This needs to be done before we create a new subtarget since any			// This needs to be done before we create a new subtarget since any
	// creation will depend on the TM and the code generation flags on the			// creation will depend on the TM and the code generation flags on the
	// function that reside in TargetOptions.			// function that reside in TargetOptions.
	resetTargetOptions(F);			resetTargetOptions(F);
	I = llvm::make_unique<AArch64Subtarget>(TargetTriple, CPU, FS, *this,			I = llvm::make_unique<AArch64Subtarget>(TargetTriple, CPU, FS, *this,
	isLittle);			Subtarget.isLittleEndian());
	#ifndef LLVM_BUILD_GLOBAL_ISEL			#ifndef LLVM_BUILD_GLOBAL_ISEL
	GISelAccessor *GISel = new GISelAccessor();			GISelAccessor *GISel = new GISelAccessor();
	#else			#else
	AArch64GISelActualAccessor *GISel =			AArch64GISelActualAccessor *GISel =
	new AArch64GISelActualAccessor();			new AArch64GISelActualAccessor();
	GISel->CallLoweringInfo.reset(			GISel->CallLoweringInfo.reset(
	new AArch64CallLowering(*I->getTargetLowering()));			new AArch64CallLowering(*I->getTargetLowering()));
	GISel->RegBankInfo.reset(			GISel->RegBankInfo.reset(
	▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/recp-fastmath.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64 -mattr=neon -recip=!divf,!vec-divf \| FileCheck %s --check-prefix=FAULT
				; RUN: llc < %s -mtriple=aarch64 -mattr=neon -recip=divf,vec-divf \| FileCheck %s

				define float @frecp(float %x) #0 {
				%div = fdiv fast float 1.0, %x
				ret float %div

				; FAULT-LABEL: frecp:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fmov
				; FAULT-NEXT: fdiv

				; CHECK-LABEL: frecp:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: frecpe
				; CHECK-NEXT: fmov
				}

				define <2 x float> @f2recp(<2 x float> %x) #0 {
				%div = fdiv fast <2 x float> <float 1.0, float 1.0>, %x
				ret <2 x float> %div

				; FAULT-LABEL: f2recp:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fmov
				; FAULT-NEXT: fdiv

				; CHECK-LABEL: f2recp:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				}

				define <4 x float> @f4recp(<4 x float> %x) #0 {
				%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
				ret <4 x float> %div

				; FAULT-LABEL: f4recp:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fmov
				; FAULT-NEXT: fdiv

				; CHECK-LABEL: f4recp:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				}

				define double @drecp(double %x) #0 {
				%div = fdiv fast double 1.0, %x
				ret double %div

				; FAULT-LABEL: drecp:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fmov
				; FAULT-NEXT: fdiv

				; CHECK-LABEL: drecp:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: frecpe
				; CHECK-NEXT: fmov
				}

				define <2 x double> @d2recp(<2 x double> %x) #0 {
				%div = fdiv fast <2 x double> <double 1.0, double 1.0>, %x
				ret <2 x double> %div

				; FAULT-LABEL: d2recp:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fmov
				; FAULT-NEXT: fdiv

				; CHECK-LABEL: d2recp:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frecpe
				}

				attributes #0 = { nounwind "unsafe-fp-math"="true" }

llvm/test/CodeGen/AArch64/sqrt-fastmath.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64 -mattr=neon -recip=!sqrt,!vec-sqrt \| FileCheck %s --check-prefix=FAULT
				; RUN: llc < %s -mtriple=aarch64 -mattr=neon -recip=sqrt,vec-sqrt \| FileCheck %s

				declare float @llvm.sqrt.f32(float) #1
				declare double @llvm.sqrt.f64(double) #1
				declare <2 x float> @llvm.sqrt.v2f32(<2 x float>) #1
				declare <4 x float> @llvm.sqrt.v4f32(<4 x float>) #1
				declare <2 x double> @llvm.sqrt.v2f64(<2 x double>) #1

				define float @fsqrt(float %a) #0 {
				%1 = tail call fast float @llvm.sqrt.f32(float %a)
				ret float %1

				; FAULT-LABEL: fsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: fsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x float> @f2sqrt(<2 x float> %a) #0 {
				%1 = tail call fast <2 x float> @llvm.sqrt.v2f32(<2 x float> %a) #2
				ret <2 x float> %1

				; FAULT-LABEL: f2sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f2sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				}

				define <4 x float> @f4sqrt(<4 x float> %a) #0 {
				%1 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %a) #2
				ret <4 x float> %1

				; FAULT-LABEL: f4sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f4sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				}

				define double @dsqrt(double %a) #0 {
				%1 = tail call fast double @llvm.sqrt.f64(double %a)
				ret double %1

				; FAULT-LABEL: dsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: dsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x double> @d2sqrt(<2 x double> %a) #0 {
				%1 = tail call fast <2 x double> @llvm.sqrt.v2f64(<2 x double> %a) #2
				ret <2 x double> %1

				; FAULT-LABEL: d2sqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: d2sqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: mov
				; CHECK-NEXT: frsqrte
				}

				define float @frsqrt(float %a) #0 {
				%1 = tail call fast float @llvm.sqrt.f32(float %a)
				%2 = fdiv fast float 1.000000e+00, %1
				ret float %2

				; FAULT-LABEL: frsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: frsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x float> @f2rsqrt(<2 x float> %a) #0 {
				%1 = tail call fast <2 x float> @llvm.sqrt.v2f32(<2 x float> %a) #2
				%2 = fdiv fast <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
				ret <2 x float> %2

				; FAULT-LABEL: f2rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f2rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <4 x float> @f4rsqrt(<4 x float> %a) #0 {
				%1 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %a) #2
				%2 = fdiv fast <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %1
				ret <4 x float> %2

				; FAULT-LABEL: f4rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: f4rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define double @drsqrt(double %a) #0 {
				%1 = tail call fast double @llvm.sqrt.f64(double %a)
				%2 = fdiv fast double 1.000000e+00, %1
				ret double %2

				; FAULT-LABEL: drsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: drsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				define <2 x double> @d2rsqrt(<2 x double> %a) #0 {
				%1 = tail call fast <2 x double> @llvm.sqrt.v2f64(<2 x double> %a) #2
				%2 = fdiv fast <2 x double> <double 1.000000e+00, double 1.000000e+00>, %1
				ret <2 x double> %2

				; FAULT-LABEL: d2rsqrt:
				; FAULT-NEXT: BB#0
				; FAULT-NEXT: fsqrt

				; CHECK-LABEL: d2rsqrt:
				; CHECK-NEXT: BB#0
				; CHECK-NEXT: fmov
				; CHECK-NEXT: frsqrte
				}

				attributes #0 = { nounwind "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use the reciprocal estimation machineryClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 56153

llvm/lib/Target/AArch64/AArch64ISelLowering.h

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/lib/Target/AArch64/AArch64TargetMachine.h

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp

llvm/test/CodeGen/AArch64/recp-fastmath.ll

llvm/test/CodeGen/AArch64/sqrt-fastmath.ll

[AArch64] Use the reciprocal estimation machinery
ClosedPublic