This is an archive of the discontinued LLVM Phabricator instance.

Change sqrt partial inlining to depend on sqrt argument rather than result.
ClosedPublic

Authored by spatel on Jan 4 2017, 1:44 PM.

Download Raw Diff

Details

Reviewers

sdardis
uweigand
eli.friedman
mkuper
hfinkel
RKSimon
craig.topper

Commits

rG0de1a4bc2d26: [PartiallyInlineLibCalls][x86] add TTI hook to allow sqrt inlining to depend on…
rL319094: [PartiallyInlineLibCalls][x86] add TTI hook to allow sqrt inlining to depend on…

Summary

This should do the right thing on X86 and resolve PR31455.
I'm not sure it makes sense for MIPS and SystemZ though, which were the original targets of the pass.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 83120.Jan 4 2017, 1:44 PM

mkuper retitled this revision from to Change sqrt partial inlining to depend on sqrt argument rather than result..

mkuper updated this object.

mkuper added reviewers: RKSimon, uweigand, sdardis.

mkuper added a subscriber: llvm-commits.

RKSimon added reviewers: eli.friedman, hfinkel.Jan 5 2017, 3:37 AM

As mentioned on PR31455, on X86 btver2, this changes goes from being slower to gcc (88cy vs 84cy) (which hoists the sqrtsd) to actually being slightly faster (82cy). This is for a tight loop of ::sqrt() calls across an array of 65535 pre-randomized doubles (~10% of which use the sqrt call and the rest use sqrtsd). This will be mainly due to reduced speculative usage of the FSQRT unit.

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp
42–45	Shouldn't this be: // if (src > 0) // v0 = sqrt_noreadmem(src) # native sqrt instruction // else // v1 = sqrt(src) # library call // dst = phi(v0, v1)
test/CodeGen/Mips/optimize-fp-math.ll
7 ↗	(On Diff #83120)	This should be: ; 32-LABEL: test_sqrtf_float_: ; 32: mtc1 $zero, $f[[R0:[0-9]+]] ; 32: c.ult.s $f12, $f[[R0]] ; 32: bc1t $BB0_[[BB0:[0-9]+]] ; 32: sqrt.s $f0, $f12 ; 32: $BB0_[[BB0]]: ; 32: jal sqrtf Similarly for the 64 case.
21 ↗	(On Diff #83120)	Similar to my comment above, except only the first mtc1 has to be matched.

In D28314#636754, @sdardis wrote:

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

I suspect that we need to use TTI here to pick the behavior here based on target preferences. What we want to know, I suspect, is: is a floating-point comparison against zero as cheap as, or cheaper than, a floating-point NaN test? -- We don't have an interface yet to make that query, but we could add one.

In D28314#636840, @hfinkel wrote:

In D28314#636754, @sdardis wrote:

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

I suspect that we need to use TTI here to pick the behavior here based on target preferences. What we want to know, I suspect, is: is a floating-point comparison against zero as cheap as, or cheaper than, a floating-point NaN test? -- We don't have an interface yet to make that query, but we could add one.

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

In D28314#636863, @RKSimon wrote:

In D28314#636840, @hfinkel wrote:

In D28314#636754, @sdardis wrote:

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

I suspect that we need to use TTI here to pick the behavior here based on target preferences. What we want to know, I suspect, is: is a floating-point comparison against zero as cheap as, or cheaper than, a floating-point NaN test? -- We don't have an interface yet to make that query, but we could add one.

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

I think this depends on how heuristic we want to make this decision process. There are tradeoffs here with OOO processing, register pressure, etc. We might just want a dedicated TTI interface. This issue with the FP materialization seems just one of many factors. As I understand it, the issue here is giving the x86 processor more time to compute the branch condition (which is difficult to do through the sqrt instruction).

In D28314#637005, @hfinkel wrote:

In D28314#636863, @RKSimon wrote:

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

I think this depends on how heuristic we want to make this decision process. There are tradeoffs here with OOO processing, register pressure, etc. We might just want a dedicated TTI interface. This issue with the FP materialization seems just one of many factors. As I understand it, the issue here is giving the x86 processor more time to compute the branch condition (which is difficult to do through the sqrt instruction).

Yes avoiding unnecessary sqrtsd is my main interest, but similar issues have been found on other tickets: PR31510 (constant folding complex pow) also attempts the fast path and only then calls the lib func if any of the results are not finite - again testing the inputs may be more sensible.

In D28314#637012, @RKSimon wrote:

In D28314#637005, @hfinkel wrote:

In D28314#636863, @RKSimon wrote:

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

I think this depends on how heuristic we want to make this decision process. There are tradeoffs here with OOO processing, register pressure, etc. We might just want a dedicated TTI interface. This issue with the FP materialization seems just one of many factors. As I understand it, the issue here is giving the x86 processor more time to compute the branch condition (which is difficult to do through the sqrt instruction).

Yes avoiding unnecessary sqrtsd is my main interest, but similar issues have been found on other tickets: PR31510 (constant folding complex pow) also attempts the fast path and only then calls the lib func if any of the results are not finite - again testing the inputs may be more sensible.

Can you clarify? The fundamental assumption here is that the non-finite-output case is rare. As such, we're not "avoiding" the sqrt instruction - we'll almost always need to execute it. Is the problem you're seeing is that the non-finite-output case is not rare, that the rare non-finite-output case is nevertheless expensive enough to worry about, or that testing early is better because of branch-handling effects?

Simon, would you like to commandeer this patch?

I don't really have a strong opinion on the trade-offs here, was just curious about how we ended up with PR31455.

lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp
42–45	The IR-level transformation is actually as described, see the good-prototype.ll test. The sqrt_noreadmem() call gets sunk into the branch later. Whether this should be the case or not is a different issue. :-)
test/CodeGen/Mips/optimize-fp-math.ll
7 ↗	(On Diff #83120)	Sure. The update script doesn't work for MIPS yet, right?

(Just to be clear, I meant @RKSimon, not @sdardis)

Taking over this patch from @mkuper - I'm going to get some tests done (well for x86 at least) and come with a proposal for TTI/TLI to control whether we should test the inputs or fast-path outputs of LibCalls.

Commandeering from the commandeer...er.

Herald added subscribers: eraman, arichardson, mcrosier. · View Herald TranscriptNov 19 2017, 10:28 AM

Patch updated:
Add a specialized TTI hook as suggested. So now this becomes a functional change only for x86 which overrides the default hook.

Ping.

Looking at this again, I'm seeing some possible cases where MIPS can do better but that's no reason to hold this patch back.

LGTM.

This revision is now accepted and ready to land.Nov 27 2017, 11:57 AM

Closed by commit rL319094: [PartiallyInlineLibCalls][x86] add TTI hook to allow sqrt inlining to depend on… (authored by spatel). · Explain WhyNov 27 2017, 1:16 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

11 lines

TargetTransformInfoImpl.h

2 lines

CodeGen/

BasicTTIImpl.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

X86/

X86TargetTransformInfo.h

1 line

X86TargetTransformInfo.cpp

4 lines

Transforms/

Scalar/

PartiallyInlineLibCalls.cpp

15 lines

test/

CodeGen/

X86/

sqrt-partial.ll

18 lines

Transforms/

PartiallyInlineLibCalls/

X86/

good-prototype.ll

24 lines

Diff 123504

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 580 Lines • ▼ Show 20 Lines	bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
bool *Fast = nullptr) const;		bool *Fast = nullptr) const;

/// \brief Return hardware support for population count.		/// \brief Return hardware support for population count.
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;

/// \brief Return true if the hardware has a fast square-root instruction.		/// \brief Return true if the hardware has a fast square-root instruction.
bool haveFastSqrt(Type *Ty) const;		bool haveFastSqrt(Type *Ty) const;

		/// Return true if it is faster to check if a floating-point value is NaN
		/// (or not-NaN) versus a comparison against a constant FP zero value.
		/// Targets should override this if materializing a 0.0 for comparison is
		/// generally as cheap as checking for ordered/unordered.
		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty) const;

/// \brief Return the expected cost of supporting the floating point operation		/// \brief Return the expected cost of supporting the floating point operation
/// of the specified type.		/// of the specified type.
int getFPOpCost(Type *Ty) const;		int getFPOpCost(Type *Ty) const;

/// \brief Return the expected cost of materializing for the given integer		/// \brief Return the expected cost of materializing for the given integer
/// immediate of the specified type.		/// immediate of the specified type.
int getIntImmCost(const APInt &Imm, Type *Ty) const;		int getIntImmCost(const APInt &Imm, Type *Ty) const;

▲ Show 20 Lines • Show All 407 Lines • ▼ Show 20 Lines	public:
virtual bool isFPVectorizationPotentiallyUnsafe() = 0;		virtual bool isFPVectorizationPotentiallyUnsafe() = 0;
virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,		virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
unsigned BitWidth,		unsigned BitWidth,
unsigned AddressSpace,		unsigned AddressSpace,
unsigned Alignment,		unsigned Alignment,
bool *Fast) = 0;		bool *Fast) = 0;
virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;		virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;
virtual bool haveFastSqrt(Type *Ty) = 0;		virtual bool haveFastSqrt(Type *Ty) = 0;
		virtual bool isFCmpOrdCheaperThanFCmpZero(Type *Ty) = 0;
virtual int getFPOpCost(Type *Ty) = 0;		virtual int getFPOpCost(Type *Ty) = 0;
virtual int getIntImmCodeSizeCost(unsigned Opc, unsigned Idx, const APInt &Imm,		virtual int getIntImmCodeSizeCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;		virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;
virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
▲ Show 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
return Impl.allowsMisalignedMemoryAccesses(Context, BitWidth, AddressSpace,		return Impl.allowsMisalignedMemoryAccesses(Context, BitWidth, AddressSpace,
Alignment, Fast);		Alignment, Fast);
}		}
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {
return Impl.getPopcntSupport(IntTyWidthInBit);		return Impl.getPopcntSupport(IntTyWidthInBit);
}		}
bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }		bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }

		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty) override {
		return Impl.isFCmpOrdCheaperThanFCmpZero(Ty);
		}

int getFPOpCost(Type *Ty) override { return Impl.getFPOpCost(Ty); }		int getFPOpCost(Type *Ty) override { return Impl.getFPOpCost(Ty); }

int getIntImmCodeSizeCost(unsigned Opc, unsigned Idx, const APInt &Imm,		int getIntImmCodeSizeCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) override {		Type *Ty) override {
return Impl.getIntImmCodeSizeCost(Opc, Idx, Imm, Ty);		return Impl.getIntImmCodeSizeCost(Opc, Idx, Imm, Ty);
}		}
int getIntImmCost(const APInt &Imm, Type *Ty) override {		int getIntImmCost(const APInt &Imm, Type *Ty) override {
return Impl.getIntImmCost(Imm, Ty);		return Impl.getIntImmCost(Imm, Ty);
▲ Show 20 Lines • Show All 299 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 310 Lines • ▼ Show 20 Lines	bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
bool *Fast) { return false; }		bool *Fast) { return false; }

TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {		TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {
return TTI::PSK_Software;		return TTI::PSK_Software;
}		}

bool haveFastSqrt(Type *Ty) { return false; }		bool haveFastSqrt(Type *Ty) { return false; }

		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty) { return true; }

unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }		unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }

int getIntImmCodeSizeCost(unsigned Opcode, unsigned Idx, const APInt &Imm,		int getIntImmCodeSizeCost(unsigned Opcode, unsigned Idx, const APInt &Imm,
Type *Ty) {		Type *Ty) {
return 0;		return 0;
}		}

unsigned getIntImmCost(const APInt &Imm, Type *Ty) { return TTI::TCC_Basic; }		unsigned getIntImmCost(const APInt &Imm, Type *Ty) { return TTI::TCC_Basic; }
▲ Show 20 Lines • Show All 502 Lines • Show Last 20 Lines

include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 291 Lines • ▼ Show 20 Lines	public:

bool haveFastSqrt(Type *Ty) {		bool haveFastSqrt(Type *Ty) {
const TargetLoweringBase *TLI = getTLI();		const TargetLoweringBase *TLI = getTLI();
EVT VT = TLI->getValueType(DL, Ty);		EVT VT = TLI->getValueType(DL, Ty);
return TLI->isTypeLegal(VT) &&		return TLI->isTypeLegal(VT) &&
TLI->isOperationLegalOrCustom(ISD::FSQRT, VT);		TLI->isOperationLegalOrCustom(ISD::FSQRT, VT);
}		}

		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty) {
		return true;
		}

unsigned getFPOpCost(Type *Ty) {		unsigned getFPOpCost(Type *Ty) {
// By default, FP instructions are no more expensive since they are		// By default, FP instructions are no more expensive since they are
// implemented in HW. Target specific TTI can override this.		// implemented in HW. Target specific TTI can override this.
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;
}		}

unsigned getOperationCost(unsigned Opcode, Type Ty, Type OpTy) {		unsigned getOperationCost(unsigned Opcode, Type Ty, Type OpTy) {
const TargetLoweringBase *TLI = getTLI();		const TargetLoweringBase *TLI = getTLI();
▲ Show 20 Lines • Show All 991 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 275 Lines • ▼ Show 20 Lines
	TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {			TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {
	return TTIImpl->getPopcntSupport(IntTyWidthInBit);			return TTIImpl->getPopcntSupport(IntTyWidthInBit);
	}			}

	bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {			bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {
	return TTIImpl->haveFastSqrt(Ty);			return TTIImpl->haveFastSqrt(Ty);
	}			}

				bool TargetTransformInfo::isFCmpOrdCheaperThanFCmpZero(Type *Ty) const {
				return TTIImpl->isFCmpOrdCheaperThanFCmpZero(Ty);
				}

	int TargetTransformInfo::getFPOpCost(Type *Ty) const {			int TargetTransformInfo::getFPOpCost(Type *Ty) const {
	int Cost = TTIImpl->getFPOpCost(Ty);			int Cost = TTIImpl->getFPOpCost(Ty);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	int TargetTransformInfo::getIntImmCodeSizeCost(unsigned Opcode, unsigned Idx,			int TargetTransformInfo::getIntImmCodeSizeCost(unsigned Opcode, unsigned Idx,
	const APInt &Imm,			const APInt &Imm,
	▲ Show 20 Lines • Show All 910 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2);		TargetTransformInfo::LSRCost &C2);
bool isLegalMaskedLoad(Type *DataType);		bool isLegalMaskedLoad(Type *DataType);
bool isLegalMaskedStore(Type *DataType);		bool isLegalMaskedStore(Type *DataType);
bool isLegalMaskedGather(Type *DataType);		bool isLegalMaskedGather(Type *DataType);
bool isLegalMaskedScatter(Type *DataType);		bool isLegalMaskedScatter(Type *DataType);
bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;
const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(		const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(
bool IsZeroCmp) const;		bool IsZeroCmp) const;
bool enableInterleavedAccessVectorization();		bool enableInterleavedAccessVectorization();
private:		private:
int getGSScalarCost(unsigned Opcode, Type *DataTy, bool VariableMask,		int getGSScalarCost(unsigned Opcode, Type *DataTy, bool VariableMask,
unsigned Alignment, unsigned AddressSpace);		unsigned Alignment, unsigned AddressSpace);
Show All 9 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,522 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalMaskedScatter(Type *DataType) {
return isLegalMaskedGather(DataType);		return isLegalMaskedGather(DataType);
}		}

bool X86TTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) {		bool X86TTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) {
EVT VT = TLI->getValueType(DL, DataType);		EVT VT = TLI->getValueType(DL, DataType);
return TLI->isOperationLegal(IsSigned ? ISD::SDIVREM : ISD::UDIVREM, VT);		return TLI->isOperationLegal(IsSigned ? ISD::SDIVREM : ISD::UDIVREM, VT);
}		}

		bool X86TTIImpl::isFCmpOrdCheaperThanFCmpZero(Type *Ty) {
		return false;
		}

bool X86TTIImpl::areInlineCompatible(const Function *Caller,		bool X86TTIImpl::areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
const TargetMachine &TM = getTLI()->getTargetMachine();		const TargetMachine &TM = getTLI()->getTargetMachine();

// Work this as a subsetting of subtarget features.		// Work this as a subsetting of subtarget features.
const FeatureBitset &CallerBits =		const FeatureBitset &CallerBits =
TM.getSubtargetImpl(*Caller)->getFeatureBits();		TM.getSubtargetImpl(*Caller)->getFeatureBits();
const FeatureBitset &CalleeBits =		const FeatureBitset &CalleeBits =
▲ Show 20 Lines • Show All 314 Lines • Show Last 20 Lines

lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp

Show All 20 Lines
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "partially-inline-libcalls"		#define DEBUG_TYPE "partially-inline-libcalls"


static bool optimizeSQRT(CallInst Call, Function CalledFunc,		static bool optimizeSQRT(CallInst Call, Function CalledFunc,
BasicBlock &CurrBB, Function::iterator &BB) {		BasicBlock &CurrBB, Function::iterator &BB,
		const TargetTransformInfo *TTI) {
// There is no need to change the IR, since backend will emit sqrt		// There is no need to change the IR, since backend will emit sqrt
// instruction if the call has already been marked read-only.		// instruction if the call has already been marked read-only.
if (Call->onlyReadsMemory())		if (Call->onlyReadsMemory())
return false;		return false;

// Do the following transformation:		// Do the following transformation:
//		//
// (before)		// (before)
// dst = sqrt(src)		// dst = sqrt(src)
//		//
// (after)		// (after)
// v0 = sqrt_noreadmem(src) # native sqrt instruction.		// v0 = sqrt_noreadmem(src) # native sqrt instruction.
// if (v0 is a NaN)		// [if (v0 is a NaN) \|\| if (src < 0)]
// v1 = sqrt(src) # library call.		// v1 = sqrt(src) # library call.
// dst = phi(v0, v1)		// dst = phi(v0, v1)
		sdardisUnsubmitted Not Done Reply Inline Actions Shouldn't this be: // if (src > 0) // v0 = sqrt_noreadmem(src) # native sqrt instruction // else // v1 = sqrt(src) # library call // dst = phi(v0, v1) sdardis: Shouldn't this be: // if (src > 0) // v0 = sqrt_noreadmem(src) # native sqrt…
		mkuperUnsubmitted Not Done Reply Inline Actions The IR-level transformation is actually as described, see the good-prototype.ll test. The sqrt_noreadmem() call gets sunk into the branch later. Whether this should be the case or not is a different issue. :-) mkuper: The IR-level transformation is actually as described, see the good-prototype.ll test. The…
//		//

// Move all instructions following Call to newly created block JoinBB.		// Move all instructions following Call to newly created block JoinBB.
// Create phi and replace all uses.		// Create phi and replace all uses.
BasicBlock *JoinBB = llvm::SplitBlock(&CurrBB, Call->getNextNode());		BasicBlock *JoinBB = llvm::SplitBlock(&CurrBB, Call->getNextNode());
IRBuilder<> Builder(JoinBB, JoinBB->begin());		IRBuilder<> Builder(JoinBB, JoinBB->begin());
PHINode *Phi = Builder.CreatePHI(Call->getType(), 2);		Type *Ty = Call->getType();
		PHINode *Phi = Builder.CreatePHI(Ty, 2);
Call->replaceAllUsesWith(Phi);		Call->replaceAllUsesWith(Phi);

// Create basic block LibCallBB and insert a call to library function sqrt.		// Create basic block LibCallBB and insert a call to library function sqrt.
BasicBlock *LibCallBB = BasicBlock::Create(CurrBB.getContext(), "call.sqrt",		BasicBlock *LibCallBB = BasicBlock::Create(CurrBB.getContext(), "call.sqrt",
CurrBB.getParent(), JoinBB);		CurrBB.getParent(), JoinBB);
Builder.SetInsertPoint(LibCallBB);		Builder.SetInsertPoint(LibCallBB);
Instruction *LibCall = Call->clone();		Instruction *LibCall = Call->clone();
Builder.Insert(LibCall);		Builder.Insert(LibCall);
Builder.CreateBr(JoinBB);		Builder.CreateBr(JoinBB);

// Add attribute "readnone" so that backend can use a native sqrt instruction		// Add attribute "readnone" so that backend can use a native sqrt instruction
// for this call. Insert a FP compare instruction and a conditional branch		// for this call. Insert a FP compare instruction and a conditional branch
// at the end of CurrBB.		// at the end of CurrBB.
Call->addAttribute(AttributeList::FunctionIndex, Attribute::ReadNone);		Call->addAttribute(AttributeList::FunctionIndex, Attribute::ReadNone);
CurrBB.getTerminator()->eraseFromParent();		CurrBB.getTerminator()->eraseFromParent();
Builder.SetInsertPoint(&CurrBB);		Builder.SetInsertPoint(&CurrBB);
Value *FCmp = Builder.CreateFCmpOEQ(Call, Call);		Value *FCmp = TTI->isFCmpOrdCheaperThanFCmpZero(Ty)
		? Builder.CreateFCmpORD(Call, Call)
		: Builder.CreateFCmpOGE(Call->getOperand(0),
		ConstantFP::get(Ty, 0.0));
Builder.CreateCondBr(FCmp, JoinBB, LibCallBB);		Builder.CreateCondBr(FCmp, JoinBB, LibCallBB);

// Add phi operands.		// Add phi operands.
Phi->addIncoming(Call, &CurrBB);		Phi->addIncoming(Call, &CurrBB);
Phi->addIncoming(LibCall, LibCallBB);		Phi->addIncoming(LibCall, LibCallBB);

BB = JoinBB->getIterator();		BB = JoinBB->getIterator();
return true;		return true;
Show All 24 Lines	for (BasicBlock::iterator II = CurrBB->begin(), IE = CurrBB->end();
if (CalledFunc->hasLocalLinkage() \|\|		if (CalledFunc->hasLocalLinkage() \|\|
!TLI->getLibFunc(*CalledFunc, LF) \|\| !TLI->has(LF))		!TLI->getLibFunc(*CalledFunc, LF) \|\| !TLI->has(LF))
continue;		continue;

switch (LF) {		switch (LF) {
case LibFunc_sqrtf:		case LibFunc_sqrtf:
case LibFunc_sqrt:		case LibFunc_sqrt:
if (TTI->haveFastSqrt(Call->getType()) &&		if (TTI->haveFastSqrt(Call->getType()) &&
optimizeSQRT(Call, CalledFunc, *CurrBB, BB))		optimizeSQRT(Call, CalledFunc, *CurrBB, BB, TTI))
break;		break;
continue;		continue;
default:		default:
continue;		continue;
}		}

Changed = true;		Changed = true;
break;		break;
▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

test/CodeGen/X86/sqrt-partial.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s

	; PR31455 - https://bugs.llvm.org/show_bug.cgi?id=31455			; PR31455 - https://bugs.llvm.org/show_bug.cgi?id=31455
	; We have to assume that errno can be set, so we have to make a libcall in that case.			; We have to assume that errno can be set, so we have to make a libcall in that case.
	; But it's better for perf to check that the argument is valid rather than the result of			; But it's better for perf to check that the argument is valid rather than the result of
	; sqrtss/sqrtsd.			; sqrtss/sqrtsd.
	; Note: This is really a test of the -partially-inline-libcalls IR pass (and we have an IR test			; Note: This is really a test of the -partially-inline-libcalls IR pass (and we have an IR test
	; for that), but we're checking the final asm to make sure that comes out as expected too.			; for that), but we're checking the final asm to make sure that comes out as expected too.

	define float @f(float %val) nounwind {			define float @f(float %val) nounwind {
	; CHECK-LABEL: f:			; CHECK-LABEL: f:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: sqrtss %xmm0, %xmm1			; CHECK-NEXT: xorps %xmm1, %xmm1
	; CHECK-NEXT: ucomiss %xmm1, %xmm1			; CHECK-NEXT: ucomiss %xmm1, %xmm0
	; CHECK-NEXT: jp .LBB0_2			; CHECK-NEXT: jb .LBB0_2
	; CHECK-NEXT: # BB#1: # %.split			; CHECK-NEXT: # BB#1: # %.split
	; CHECK-NEXT: movaps %xmm1, %xmm0			; CHECK-NEXT: sqrtss %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; CHECK-NEXT: .LBB0_2: # %call.sqrt			; CHECK-NEXT: .LBB0_2: # %call.sqrt
	; CHECK-NEXT: jmp sqrtf # TAILCALL			; CHECK-NEXT: jmp sqrtf # TAILCALL
	%res = tail call float @sqrtf(float %val)			%res = tail call float @sqrtf(float %val)
	ret float %res			ret float %res
	}			}

	define double @d(double %val) nounwind {			define double @d(double %val) nounwind {
	; CHECK-LABEL: d:			; CHECK-LABEL: d:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: sqrtsd %xmm0, %xmm1			; CHECK-NEXT: xorps %xmm1, %xmm1
	; CHECK-NEXT: ucomisd %xmm1, %xmm1			; CHECK-NEXT: ucomisd %xmm1, %xmm0
	; CHECK-NEXT: jp .LBB1_2			; CHECK-NEXT: jb .LBB1_2
	; CHECK-NEXT: # BB#1: # %.split			; CHECK-NEXT: # BB#1: # %.split
	; CHECK-NEXT: movapd %xmm1, %xmm0			; CHECK-NEXT: sqrtsd %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; CHECK-NEXT: .LBB1_2: # %call.sqrt			; CHECK-NEXT: .LBB1_2: # %call.sqrt
	; CHECK-NEXT: jmp sqrt # TAILCALL			; CHECK-NEXT: jmp sqrt # TAILCALL
	%res = tail call double @sqrt(double %val)			%res = tail call double @sqrt(double %val)
	ret double %res			ret double %res
	}			}

	declare float @sqrtf(float)			declare float @sqrtf(float)
	declare double @sqrt(double)			declare double @sqrt(double)

test/Transforms/PartiallyInlineLibCalls/X86/good-prototype.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -partially-inline-libcalls -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s			; RUN: opt -S -partially-inline-libcalls -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s
	; RUN: opt -S -passes=partially-inline-libcalls -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s			; RUN: opt -S -passes=partially-inline-libcalls -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s

	define float @f(float %val) {			define float @f(float %val) {
	; CHECK: @f			; CHECK-LABEL: @f(
	; CHECK: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: %[[RES:.+]] = tail call float @sqrtf(float %val) #0			; CHECK-NEXT: [[RES:%.]] = tail call float @sqrtf(float [[VAL:%.]]) #0
	; CHECK-NEXT: %[[CMP:.+]] = fcmp oeq float %[[RES]], %[[RES]]			; CHECK-NEXT: [[TMP0:%.*]] = fcmp oge float [[VAL]], 0.000000e+00
	; CHECK-NEXT: br i1 %[[CMP]], label %[[EXIT:.+]], label %[[CALL:.+]]			; CHECK-NEXT: br i1 [[TMP0]], label [[ENTRY_SPLIT:%.]], label [[CALL_SQRT:%.]]
	; CHECK: [[CALL]]:			; CHECK: call.sqrt:
	; CHECK-NEXT: %[[RES2:.+]] = tail call float @sqrtf(float %val){{$}}			; CHECK-NEXT: [[TMP1:%.*]] = tail call float @sqrtf(float [[VAL]])
	; CHECK-NEXT: br label %[[EXIT]]			; CHECK-NEXT: br label [[ENTRY_SPLIT]]
	; CHECK: [[EXIT]]:			; CHECK: entry.split:
	; CHECK-NEXT: %[[RET:.+]] = phi float [ %[[RES]], %entry ], [ %[[RES2]], %[[CALL]] ]			; CHECK-NEXT: [[TMP2:%.]] = phi float [ [[RES]], [[ENTRY:%.]] ], [ [[TMP1]], [[CALL_SQRT]] ]
	; CHECK-NEXT: ret float %[[RET]]			; CHECK-NEXT: ret float [[TMP2]]
				;
	entry:			entry:
	%res = tail call float @sqrtf(float %val)			%res = tail call float @sqrtf(float %val)
	ret float %res			ret float %res
	}			}

	declare float @sqrtf(float)			declare float @sqrtf(float)

This is an archive of the discontinued LLVM Phabricator instance.

Change sqrt partial inlining to depend on sqrt argument rather than result.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 123504

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

include/llvm/CodeGen/BasicTTIImpl.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/X86/X86TargetTransformInfo.h

lib/Target/X86/X86TargetTransformInfo.cpp

lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp

test/CodeGen/X86/sqrt-partial.ll

test/Transforms/PartiallyInlineLibCalls/X86/good-prototype.ll

Change sqrt partial inlining to depend on sqrt argument rather than result.
ClosedPublic