This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/ARM/
-
ARM/
-
ARMTargetTransformInfo.h
-
ARMTargetTransformInfo.cpp
-
Transforms/Scalar/
-
Scalar/
2/3
LoopStrengthReduce.cpp
-
test/
-
CodeGen/Thumb2/LowOverheadLoops/
-
Thumb2/
-
LowOverheadLoops/
-
lsr-profitable-chain.ll
-
Transforms/LoopStrengthReduce/ARM/
-
LoopStrengthReduce/
-
ARM/
-
vctp-chains.ll

Differential D79418

[LSR][ARM] Add new TTI hook to mark some LSR chains as profitable
ClosedPublic

Authored by Pierre-vh on May 5 2020, 7:06 AM.

Download Raw Diff

Details

Reviewers

samparker
dmgreen
SjoerdMeijer

Commits

rG2668775f6665: [LSR][ARM] Add new TTI hook to mark some LSR chains as profitable

Summary

Sometimes, LSR may change the way the ARM VCTP intrinsic's operand is calculated, and the change, while correct, will completely block tail-predication, for instance by defining the operand inside the loop.
This patch aims to fix this issue by adding a new TTI hook to tell LSR to ignore some instructions (for now, only the VCTP intrinsic on ARM).

This patch adds:

A new TTI hook: bool canLSRFixupInstruction(Instruction *I), which returns false when LSR shouldn't change I's operands.
A new function in LSR called FilterOutUndesirableUses, which calls this new TTI hook on every LSRUse's LSRFixup UserInst, and deletes the LSRUse if the hook returns false for one of the instructions.
An impl of this TTI hook for ARM, which returns false for VCTP intrinsics.

Note that I'm unsure about these changes. Do you feel like this is an appropriate fix for this issue?
Allowing LSR to do its thing and fixing it ourselves later in a backend pass is tricky and fragile, so I personally feel that fixing the problem in LSR directly is the best course of action.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Pierre-vh created this revision.May 5 2020, 7:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2020, 7:06 AM

Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

Harbormaster failed remote builds in B55788: Diff 262102!May 5 2020, 8:03 AM

samparker added inline comments.May 5 2020, 11:32 PM

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
5587	How about doing the filtering straight away, even before 'OptimizeShadowIV'? Can we just remove the instruction before we generate any formulae?

Changing the implementation of the patch.

TTI hook renamed to isProfitableLSRChainElement
- It now returns true for the VCTP
Removed FilterOutUndesirableUses
Now isProfitableLSRChainElement is called in LSR's isProfitableChain function. If it returns true for one of the chain's UserInst, the chain will be considered profitable and will not be optimized by LSR.

Looks good. Would you mind also adding a codegen test so we can see this going through the whole pipeline, successfully being tail predicated?

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
2856	Is this necessary? If this is executed in the loop then I don't see the worth of optimising the first iteration.

Simplified the vctp-chains.ll test: removed useless attributes
Added codegen test to test the whole pipeline (based on the vctpi32 test from vctp-chains.ll)

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp

2856

It's not executed in the loop below, that one skips the first element, so the first element has to be handled separately.

See IVChain's implementation:

struct IVChain {
  SmallVector<IVInc, 1> Incs;
  
   /* ... */

  using const_iterator = SmallVectorImpl<IVInc>::const_iterator;

  // Return the first increment in the chain.
  const_iterator begin() const {
    assert(!Incs.empty());
    return std::next(Incs.begin());
  }
  const_iterator end() const {
    return Incs.end();
  }

  /* ... */
};

LGTM, cheers.

This revision is now accepted and ready to land.May 12 2020, 11:39 PM

Closed by commit rG2668775f6665: [LSR][ARM] Add new TTI hook to mark some LSR chains as profitable (authored by Pierre-vh). · Explain WhyMay 13 2020, 6:27 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

TargetTransformInfoImpl.h

2 lines

CodeGen/

BasicTTIImpl.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

ARM/

ARMTargetTransformInfo.h

2 lines

ARMTargetTransformInfo.cpp

18 lines

Transforms/

Scalar/

LoopStrengthReduce.cpp

16 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

lsr-profitable-chain.ll

69 lines

Transforms/

LoopStrengthReduce/

ARM/

vctp-chains.ll

257 lines

Diff 263692

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 513 Lines • ▼ Show 20 Lines	bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
unsigned AddrSpace = 0,		unsigned AddrSpace = 0,
Instruction *I = nullptr) const;		Instruction *I = nullptr) const;

/// Return true if LSR cost of C1 is lower than C1.		/// Return true if LSR cost of C1 is lower than C1.
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) const;		TargetTransformInfo::LSRCost &C2) const;

		/// \returns true if LSR should not optimize a chain that includes \p I.
		bool isProfitableLSRChainElement(Instruction *I) const;

/// Return true if the target can fuse a compare and branch.		/// Return true if the target can fuse a compare and branch.
/// Loop-strength-reduction (LSR) uses that knowledge to adjust its cost		/// Loop-strength-reduction (LSR) uses that knowledge to adjust its cost
/// calculation for the instructions in a loop.		/// calculation for the instructions in a loop.
bool canMacroFuseCmp() const;		bool canMacroFuseCmp() const;

/// Return true if the target can save a compare for loop count, for example		/// Return true if the target can save a compare for loop count, for example
/// hardware loop saves a compare.		/// hardware loop saves a compare.
bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE, LoopInfo *LI,		bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE, LoopInfo *LI,
▲ Show 20 Lines • Show All 698 Lines • ▼ Show 20 Lines	public:
virtual bool isLegalAddImmediate(int64_t Imm) = 0;		virtual bool isLegalAddImmediate(int64_t Imm) = 0;
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale, unsigned AddrSpace,		int64_t Scale, unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) = 0;		TargetTransformInfo::LSRCost &C2) = 0;
		virtual bool isProfitableLSRChainElement(Instruction *I) = 0;
virtual bool canMacroFuseCmp() = 0;		virtual bool canMacroFuseCmp() = 0;
virtual bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE,		virtual bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE,
LoopInfo LI, DominatorTree DT, AssumptionCache *AC,		LoopInfo LI, DominatorTree DT, AssumptionCache *AC,
TargetLibraryInfo *LibInfo) = 0;		TargetLibraryInfo *LibInfo) = 0;
virtual bool shouldFavorPostInc() const = 0;		virtual bool shouldFavorPostInc() const = 0;
virtual bool shouldFavorBackedgeIndex(const Loop *L) const = 0;		virtual bool shouldFavorBackedgeIndex(const Loop *L) const = 0;
virtual bool isLegalMaskedStore(Type *DataType, MaybeAlign Alignment) = 0;		virtual bool isLegalMaskedStore(Type *DataType, MaybeAlign Alignment) = 0;
virtual bool isLegalMaskedLoad(Type *DataType, MaybeAlign Alignment) = 0;		virtual bool isLegalMaskedLoad(Type *DataType, MaybeAlign Alignment) = 0;
▲ Show 20 Lines • Show All 293 Lines • ▼ Show 20 Lines	bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
Instruction *I) override {		Instruction *I) override {
return Impl.isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg, Scale,		return Impl.isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg, Scale,
AddrSpace, I);		AddrSpace, I);
}		}
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) override {		TargetTransformInfo::LSRCost &C2) override {
return Impl.isLSRCostLess(C1, C2);		return Impl.isLSRCostLess(C1, C2);
}		}
		bool isProfitableLSRChainElement(Instruction *I) override {
		return Impl.isProfitableLSRChainElement(I);
		}
bool canMacroFuseCmp() override { return Impl.canMacroFuseCmp(); }		bool canMacroFuseCmp() override { return Impl.canMacroFuseCmp(); }
bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE, LoopInfo *LI,		bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE, LoopInfo *LI,
DominatorTree DT, AssumptionCache AC,		DominatorTree DT, AssumptionCache AC,
TargetLibraryInfo *LibInfo) override {		TargetLibraryInfo *LibInfo) override {
return Impl.canSaveCmp(L, BI, SE, LI, DT, AC, LibInfo);		return Impl.canSaveCmp(L, BI, SE, LI, DT, AC, LibInfo);
}		}
bool shouldFavorPostInc() const override { return Impl.shouldFavorPostInc(); }		bool shouldFavorPostInc() const override { return Impl.shouldFavorPostInc(); }
bool shouldFavorBackedgeIndex(const Loop *L) const override {		bool shouldFavorBackedgeIndex(const Loop *L) const override {
▲ Show 20 Lines • Show All 493 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	public:

bool isLSRCostLess(TTI::LSRCost &C1, TTI::LSRCost &C2) {		bool isLSRCostLess(TTI::LSRCost &C1, TTI::LSRCost &C2) {
return std::tie(C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds,		return std::tie(C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds,
C1.ScaleCost, C1.ImmCost, C1.SetupCost) <		C1.ScaleCost, C1.ImmCost, C1.SetupCost) <
std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,		std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,
C2.ScaleCost, C2.ImmCost, C2.SetupCost);		C2.ScaleCost, C2.ImmCost, C2.SetupCost);
}		}

		bool isProfitableLSRChainElement(Instruction *I) { return false; }

bool canMacroFuseCmp() { return false; }		bool canMacroFuseCmp() { return false; }

bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE, LoopInfo *LI,		bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE, LoopInfo *LI,
DominatorTree DT, AssumptionCache AC,		DominatorTree DT, AssumptionCache AC,
TargetLibraryInfo *LibInfo) {		TargetLibraryInfo *LibInfo) {
return false;		return false;
}		}

▲ Show 20 Lines • Show All 748 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	bool isIndexedStoreLegal(TTI::MemIndexedMode M, Type *Ty,
EVT VT = getTLI()->getValueType(DL, Ty);		EVT VT = getTLI()->getValueType(DL, Ty);
return getTLI()->isIndexedStoreLegal(getISDIndexedMode(M), VT);		return getTLI()->isIndexedStoreLegal(getISDIndexedMode(M), VT);
}		}

bool isLSRCostLess(TTI::LSRCost C1, TTI::LSRCost C2) {		bool isLSRCostLess(TTI::LSRCost C1, TTI::LSRCost C2) {
return TargetTransformInfoImplBase::isLSRCostLess(C1, C2);		return TargetTransformInfoImplBase::isLSRCostLess(C1, C2);
}		}

		bool isProfitableLSRChainElement(Instruction *I) {
		return TargetTransformInfoImplBase::isProfitableLSRChainElement(I);
		}

int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {		bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {
TargetLoweringBase::AddrMode AM;		TargetLoweringBase::AddrMode AM;
AM.BaseGV = BaseGV;		AM.BaseGV = BaseGV;
AM.BaseOffs = BaseOffset;		AM.BaseOffs = BaseOffset;
AM.HasBaseReg = HasBaseReg;		AM.HasBaseReg = HasBaseReg;
AM.Scale = Scale;		AM.Scale = Scale;
return getTLI()->getScalingFactorCost(DL, AM, Ty, AddrSpace);		return getTLI()->getScalingFactorCost(DL, AM, Ty, AddrSpace);
▲ Show 20 Lines • Show All 1,549 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 255 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
return TTIImpl->isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,		return TTIImpl->isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace, I);		Scale, AddrSpace, I);
}		}

bool TargetTransformInfo::isLSRCostLess(LSRCost &C1, LSRCost &C2) const {		bool TargetTransformInfo::isLSRCostLess(LSRCost &C1, LSRCost &C2) const {
return TTIImpl->isLSRCostLess(C1, C2);		return TTIImpl->isLSRCostLess(C1, C2);
}		}

		bool TargetTransformInfo::isProfitableLSRChainElement(Instruction *I) const {
		return TTIImpl->isProfitableLSRChainElement(I);
		}

bool TargetTransformInfo::canMacroFuseCmp() const {		bool TargetTransformInfo::canMacroFuseCmp() const {
return TTIImpl->canMacroFuseCmp();		return TTIImpl->canMacroFuseCmp();
}		}

bool TargetTransformInfo::canSaveCmp(Loop L, BranchInst *BI,		bool TargetTransformInfo::canSaveCmp(Loop L, BranchInst *BI,
ScalarEvolution SE, LoopInfo LI,		ScalarEvolution SE, LoopInfo LI,
DominatorTree DT, AssumptionCache AC,		DominatorTree DT, AssumptionCache AC,
TargetLibraryInfo *LibInfo) const {		TargetLibraryInfo *LibInfo) const {
▲ Show 20 Lines • Show All 1,156 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	unsigned getRegisterBitWidth(bool Vector) const {

return 32;		return 32;
}		}

unsigned getMaxInterleaveFactor(unsigned VF) {		unsigned getMaxInterleaveFactor(unsigned VF) {
return ST->getMaxInterleaveFactor();		return ST->getMaxInterleaveFactor();
}		}

		bool isProfitableLSRChainElement(Instruction *I);

bool isLegalMaskedLoad(Type *DataTy, MaybeAlign Alignment);		bool isLegalMaskedLoad(Type *DataTy, MaybeAlign Alignment);

bool isLegalMaskedStore(Type *DataTy, MaybeAlign Alignment) {		bool isLegalMaskedStore(Type *DataTy, MaybeAlign Alignment) {
return isLegalMaskedLoad(DataTy, Alignment);		return isLegalMaskedLoad(DataTy, Alignment);
}		}

bool isLegalMaskedGather(Type *Ty, MaybeAlign Alignment);		bool isLegalMaskedGather(Type *Ty, MaybeAlign Alignment);

▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show All 15 Lines
#include "llvm/CodeGen/ISDOpcodes.h"		#include "llvm/CodeGen/ISDOpcodes.h"
#include "llvm/CodeGen/ValueTypes.h"		#include "llvm/CodeGen/ValueTypes.h"
#include "llvm/IR/BasicBlock.h"		#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
		#include "llvm/IR/IntrinsicsARM.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/MC/SubtargetFeature.h"		#include "llvm/MC/SubtargetFeature.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/MachineValueType.h"		#include "llvm/Support/MachineValueType.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
#include <algorithm>		#include <algorithm>
#include <cassert>		#include <cassert>
▲ Show 20 Lines • Show All 513 Lines • ▼ Show 20 Lines	if (ST->hasNEON()) {

// In many cases the address computation is not merged into the instruction		// In many cases the address computation is not merged into the instruction
// addressing mode.		// addressing mode.
return 1;		return 1;
}		}
return BaseT::getAddressComputationCost(Ty, SE, Ptr);		return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}		}

		bool ARMTTIImpl::isProfitableLSRChainElement(Instruction *I) {
		if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(I)) {
		// If a VCTP is part of a chain, it's already profitable and shouldn't be
		// optimized, else LSR may block tail-predication.
		switch (II->getIntrinsicID()) {
		case Intrinsic::arm_mve_vctp8:
		case Intrinsic::arm_mve_vctp16:
		case Intrinsic::arm_mve_vctp32:
		case Intrinsic::arm_mve_vctp64:
		return true;
		default:
		break;
		}
		}
		return false;
		}

bool ARMTTIImpl::isLegalMaskedLoad(Type *DataTy, MaybeAlign Alignment) {		bool ARMTTIImpl::isLegalMaskedLoad(Type *DataTy, MaybeAlign Alignment) {
if (!EnableMaskedLoadStores \|\| !ST->hasMVEIntegerOps())		if (!EnableMaskedLoadStores \|\| !ST->hasMVEIntegerOps())
return false;		return false;

if (auto *VecTy = dyn_cast<VectorType>(DataTy)) {		if (auto *VecTy = dyn_cast<VectorType>(DataTy)) {
// Don't support v2i1 yet.		// Don't support v2i1 yet.
if (VecTy->getNumElements() == 2)		if (VecTy->getNumElements() == 2)
return false;		return false;
▲ Show 20 Lines • Show All 906 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp

Show First 20 Lines • Show All 2,814 Lines • ▼ Show 20 Lines
/// any IV users that keep the IV live across increments (the Users set should		/// any IV users that keep the IV live across increments (the Users set should
/// be empty). Next count the number and type of increments in the chain.		/// be empty). Next count the number and type of increments in the chain.
///		///
/// Chaining IVs can lead to considerable code bloat if ISEL doesn't		/// Chaining IVs can lead to considerable code bloat if ISEL doesn't
/// effectively use postinc addressing modes. Only consider it profitable it the		/// effectively use postinc addressing modes. Only consider it profitable it the
/// increments can be computed in fewer registers when chained.		/// increments can be computed in fewer registers when chained.
///		///
/// TODO: Consider IVInc free if it's already used in another chains.		/// TODO: Consider IVInc free if it's already used in another chains.
static bool		static bool isProfitableChain(IVChain &Chain,
isProfitableChain(IVChain &Chain, SmallPtrSetImpl<Instruction*> &Users,		SmallPtrSetImpl<Instruction *> &Users,
ScalarEvolution &SE) {		ScalarEvolution &SE,
		const TargetTransformInfo &TTI) {
if (StressIVChain)		if (StressIVChain)
return true;		return true;

if (!Chain.hasIncs())		if (!Chain.hasIncs())
return false;		return false;

if (!Users.empty()) {		if (!Users.empty()) {
LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " users:\n";		LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " users:\n";
Show All 12 Lines	static bool isProfitableChain(IVChain &Chain,
if (isa<PHINode>(Chain.tailUserInst())		if (isa<PHINode>(Chain.tailUserInst())
&& SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr) {		&& SE.getSCEV(Chain.tailUserInst()) == Chain.Incs[0].IncExpr) {
--cost;		--cost;
}		}
const SCEV *LastIncExpr = nullptr;		const SCEV *LastIncExpr = nullptr;
unsigned NumConstIncrements = 0;		unsigned NumConstIncrements = 0;
unsigned NumVarIncrements = 0;		unsigned NumVarIncrements = 0;
unsigned NumReusedIncrements = 0;		unsigned NumReusedIncrements = 0;

		if (TTI.isProfitableLSRChainElement(Chain.Incs[0].UserInst))
		samparkerUnsubmitted Done Reply Inline Actions Is this necessary? If this is executed in the loop then I don't see the worth of optimising the first iteration. samparker: Is this necessary? If this is executed in the loop then I don't see the worth of optimising the…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions It's not executed in the loop below, that one skips the first element, so the first element has to be handled separately. See IVChain's implementation: struct IVChain { SmallVector<IVInc, 1> Incs; /* ... / using const_iterator = SmallVectorImpl<IVInc>::const_iterator; // Return the first increment in the chain. const_iterator begin() const { assert(!Incs.empty()); return std::next(Incs.begin()); } const_iterator end() const { return Incs.end(); } / ... / }; Pierre-vh:* It's not executed in the loop below, that one skips the first element, so the first element has…
		return true;

for (const IVInc &Inc : Chain) {		for (const IVInc &Inc : Chain) {
		if (TTI.isProfitableLSRChainElement(Inc.UserInst))
		return true;

if (Inc.IncExpr->isZero())		if (Inc.IncExpr->isZero())
continue;		continue;

// Incrementing by zero or some constant is neutral. We assume constants can		// Incrementing by zero or some constant is neutral. We assume constants can
// be folded into an addressing mode or an add's immediate operand.		// be folded into an addressing mode or an add's immediate operand.
if (isa<SCEVConstant>(Inc.IncExpr)) {		if (isa<SCEVConstant>(Inc.IncExpr)) {
++NumConstIncrements;		++NumConstIncrements;
continue;		continue;
▲ Show 20 Lines • Show All 214 Lines • ▼ Show 20 Lines	for (PHINode &PN : L->getHeader()->phis()) {
if (IncV)		if (IncV)
ChainInstruction(&PN, IncV, ChainUsersVec);		ChainInstruction(&PN, IncV, ChainUsersVec);
}		}
// Remove any unprofitable chains.		// Remove any unprofitable chains.
unsigned ChainIdx = 0;		unsigned ChainIdx = 0;
for (unsigned UsersIdx = 0, NChains = IVChainVec.size();		for (unsigned UsersIdx = 0, NChains = IVChainVec.size();
UsersIdx < NChains; ++UsersIdx) {		UsersIdx < NChains; ++UsersIdx) {
if (!isProfitableChain(IVChainVec[UsersIdx],		if (!isProfitableChain(IVChainVec[UsersIdx],
ChainUsersVec[UsersIdx].FarUsers, SE))		ChainUsersVec[UsersIdx].FarUsers, SE, TTI))
continue;		continue;
// Preserve the chain at UsesIdx.		// Preserve the chain at UsesIdx.
if (ChainIdx != UsersIdx)		if (ChainIdx != UsersIdx)
IVChainVec[ChainIdx] = IVChainVec[UsersIdx];		IVChainVec[ChainIdx] = IVChainVec[UsersIdx];
FinalizeChain(IVChainVec[ChainIdx]);		FinalizeChain(IVChainVec[ChainIdx]);
++ChainIdx;		++ChainIdx;
}		}
IVChainVec.resize(ChainIdx);		IVChainVec.resize(ChainIdx);
▲ Show 20 Lines • Show All 2,477 Lines • ▼ Show 20 Lines	#endif // DEBUG

LLVM_DEBUG(dbgs() << "LSR found " << Uses.size() << " uses:\n";		LLVM_DEBUG(dbgs() << "LSR found " << Uses.size() << " uses:\n";
print_uses(dbgs()));		print_uses(dbgs()));

// Now use the reuse data to generate a bunch of interesting ways		// Now use the reuse data to generate a bunch of interesting ways
// to formulate the values needed for the uses.		// to formulate the values needed for the uses.
GenerateAllReuseFormulae();		GenerateAllReuseFormulae();

FilterOutUndesirableDedicatedRegisters();		FilterOutUndesirableDedicatedRegisters();
		samparkerUnsubmitted Not Done Reply Inline Actions How about doing the filtering straight away, even before 'OptimizeShadowIV'? Can we just remove the instruction before we generate any formulae? samparker: How about doing the filtering straight away, even before 'OptimizeShadowIV'? Can we just remove…
NarrowSearchSpaceUsingHeuristics();		NarrowSearchSpaceUsingHeuristics();

SmallVector<const Formula *, 8> Solution;		SmallVector<const Formula *, 8> Solution;
Solve(Solution);		Solve(Solution);

// Release memory that is no longer needed.		// Release memory that is no longer needed.
Factors.clear();		Factors.clear();
Types.clear();		Types.clear();
▲ Show 20 Lines • Show All 199 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/lsr-profitable-chain.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -O3 -disable-mve-tail-predication=false -mtriple=thumbv8.1m.main -mattr=+mve,+mve.fp %s -o - \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8.1m-arm-none-eabi"

				; Tests that LSR will not interfere with the VCTP intrinsic,
				; and that this loop will correctly become tail-predicated.

				define arm_aapcs_vfpcc float @vctpi32(float* %0, i32 %1) {
				; CHECK-LABEL: vctpi32:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: vmvn.i32 q1, #0x1f
				; CHECK-NEXT: vmov.32 q3[0], r0
				; CHECK-NEXT: movs r2, #0
				; CHECK-NEXT: vadd.i32 q1, q3, q1
				; CHECK-NEXT: subs r3, r1, #1
				; CHECK-NEXT: vidup.u32 q2, r2, #8
				; CHECK-NEXT: vmov r0, s4
				; CHECK-NEXT: vadd.i32 q1, q2, r0
				; CHECK-NEXT: vmov.i32 q0, #0x0
				; CHECK-NEXT: dlstp.32 lr, r3
				; CHECK-NEXT: .LBB0_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q2, [q1, #32]!
				; CHECK-NEXT: vadd.f32 q0, q0, q2
				; CHECK-NEXT: letp lr, .LBB0_1
				; CHECK-NEXT: @ %bb.2:
				; CHECK-NEXT: bl vecAddAcrossF32Mve
				; CHECK-NEXT: vmov s0, r0
				; CHECK-NEXT: vcvt.f32.s32 s0, s0
				; CHECK-NEXT: vabs.f32 s0, s0
				; CHECK-NEXT: pop {r7, pc}
				%3 = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				%4 = extractvalue { <4 x i32>, i32 } %3, 0
				%5 = add nsw i32 %1, -1
				%6 = ptrtoint float* %0 to i32
				%7 = insertelement <4 x i32> undef, i32 %6, i32 0
				%8 = add <4 x i32> %7, <i32 -32, i32 undef, i32 undef, i32 undef>
				%9 = shufflevector <4 x i32> %8, <4 x i32> undef, <4 x i32> zeroinitializer
				%10 = add <4 x i32> %4, %9
				br label %11

				11:
				%12 = phi i32 [ %5, %2 ], [ %20, %11 ]
				%13 = phi <4 x float> [ zeroinitializer, %2 ], [ %19, %11 ]
				%14 = phi <4 x i32> [ %10, %2 ], [ %17, %11 ]
				%15 = tail call <4 x i1> @llvm.arm.mve.vctp32(i32 %12)
				%16 = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> %14, i32 32, <4 x i1> %15)
				%17 = extractvalue { <4 x float>, <4 x i32> } %16, 1
				%18 = extractvalue { <4 x float>, <4 x i32> } %16, 0
				%19 = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> %13, <4 x float> %18, <4 x i1> %15, <4 x float> %13)
				%20 = add nsw i32 %12, -4
				%21 = icmp sgt i32 %12, 4
				br i1 %21, label %11, label %22

				22:
				%23 = tail call arm_aapcs_vfpcc i32 bitcast (i32 (...)* @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> %19)
				%24 = sitofp i32 %23 to float
				%25 = tail call float @llvm.fabs.f32(float %24)
				ret float %25
				}

				declare { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32, i32)
				declare <4 x i1> @llvm.arm.mve.vctp32(i32)
				declare { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32>, i32, <4 x i1>)
				declare <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float>, <4 x float>, <4 x i1>, <4 x float>)
				declare arm_aapcs_vfpcc i32 @vecAddAcrossF32Mve(...)
				declare float @llvm.fabs.f32(float)

llvm/test/Transforms/LoopStrengthReduce/ARM/vctp-chains.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=thumbv8.1m.main -mattr=+mve %s -S -loop-reduce -o - \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8.1m-arm-none-eabi"

				define float @vctp8(float* %0, i32 %1) {
				; CHECK-LABEL: @vctp8(
				; CHECK-NEXT: [[TMP3:%.*]] = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				; CHECK-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, i32 } [[TMP3]], 0
				; CHECK-NEXT: [[TMP5:%.]] = add nsw i32 [[TMP1:%.]], -1
				; CHECK-NEXT: [[TMP6:%.]] = ptrtoint float [[TMP0:%.*]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
				; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i32> [[TMP7]], <i32 -32, i32 undef, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP10:%.*]] = add <4 x i32> [[TMP4]], [[TMP9]]
				; CHECK-NEXT: br label [[TMP11:%.*]]
				; CHECK: 11:
				; CHECK-NEXT: [[TMP12:%.]] = phi i32 [ [[TMP5]], [[TMP2:%.]] ], [ [[TMP21:%.*]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP13:%.]] = phi <4 x float> [ zeroinitializer, [[TMP2]] ], [ [[TMP19:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP14:%.]] = phi <4 x i32> [ [[TMP10]], [[TMP2]] ], [ [[TMP17:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP15:%.*]] = tail call <16 x i1> @llvm.arm.mve.vctp8(i32 [[TMP12]])
				; CHECK-NEXT: [[MASK:%.*]] = tail call <4 x i1> @v16i1_to_v4i1(<16 x i1> [[TMP15]])
				; CHECK-NEXT: [[TMP16:%.*]] = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> [[TMP14]], i32 32, <4 x i1> [[MASK]])
				; CHECK-NEXT: [[TMP17]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 1
				; CHECK-NEXT: [[TMP18:%.*]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 0
				; CHECK-NEXT: [[TMP19]] = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> [[TMP13]], <4 x float> [[TMP18]], <4 x i1> [[MASK]], <4 x float> [[TMP13]])
				; CHECK-NEXT: [[TMP20:%.*]] = icmp sgt i32 [[TMP12]], 4
				; CHECK-NEXT: [[TMP21]] = add i32 [[TMP12]], -4
				; CHECK-NEXT: br i1 [[TMP20]], label [[TMP11]], label [[TMP22:%.*]]
				; CHECK: 22:
				; CHECK-NEXT: [[TMP23:%.]] = tail call i32 bitcast (i32 (...) @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> [[TMP19]])
				; CHECK-NEXT: [[TMP24:%.*]] = sitofp i32 [[TMP23]] to float
				; CHECK-NEXT: [[TMP25:%.*]] = tail call float @llvm.fabs.f32(float [[TMP24]])
				; CHECK-NEXT: ret float [[TMP25]]
				;
				%3 = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				%4 = extractvalue { <4 x i32>, i32 } %3, 0
				%5 = add nsw i32 %1, -1
				%6 = ptrtoint float* %0 to i32
				%7 = insertelement <4 x i32> undef, i32 %6, i32 0
				%8 = add <4 x i32> %7, <i32 -32, i32 undef, i32 undef, i32 undef>
				%9 = shufflevector <4 x i32> %8, <4 x i32> undef, <4 x i32> zeroinitializer
				%10 = add <4 x i32> %4, %9
				br label %11

				11: ; preds = %11, %2
				%12 = phi i32 [ %5, %2 ], [ %20, %11 ]
				%13 = phi <4 x float> [ zeroinitializer, %2 ], [ %19, %11 ]
				%14 = phi <4 x i32> [ %10, %2 ], [ %17, %11 ]
				%15 = tail call <16 x i1> @llvm.arm.mve.vctp8(i32 %12)
				%mask = tail call <4 x i1> @v16i1_to_v4i1(<16 x i1> %15)
				%16 = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> %14, i32 32, <4 x i1> %mask)
				%17 = extractvalue { <4 x float>, <4 x i32> } %16, 1
				%18 = extractvalue { <4 x float>, <4 x i32> } %16, 0
				%19 = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> %13, <4 x float> %18, <4 x i1> %mask, <4 x float> %13)
				%20 = add nsw i32 %12, -4
				%21 = icmp sgt i32 %12, 4
				br i1 %21, label %11, label %22

				22: ; preds = %11
				%23 = tail call i32 bitcast (i32 (...)* @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> %19)
				%24 = sitofp i32 %23 to float
				%25 = tail call float @llvm.fabs.f32(float %24)
				ret float %25
				}

				define float @vctp16(float* %0, i32 %1) {
				; CHECK-LABEL: @vctp16(
				; CHECK-NEXT: [[TMP3:%.*]] = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				; CHECK-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, i32 } [[TMP3]], 0
				; CHECK-NEXT: [[TMP5:%.]] = add nsw i32 [[TMP1:%.]], -1
				; CHECK-NEXT: [[TMP6:%.]] = ptrtoint float [[TMP0:%.*]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
				; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i32> [[TMP7]], <i32 -32, i32 undef, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP10:%.*]] = add <4 x i32> [[TMP4]], [[TMP9]]
				; CHECK-NEXT: br label [[TMP11:%.*]]
				; CHECK: 11:
				; CHECK-NEXT: [[TMP12:%.]] = phi i32 [ [[TMP5]], [[TMP2:%.]] ], [ [[TMP21:%.*]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP13:%.]] = phi <4 x float> [ zeroinitializer, [[TMP2]] ], [ [[TMP19:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP14:%.]] = phi <4 x i32> [ [[TMP10]], [[TMP2]] ], [ [[TMP17:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP15:%.*]] = tail call <8 x i1> @llvm.arm.mve.vctp16(i32 [[TMP12]])
				; CHECK-NEXT: [[MASK:%.*]] = tail call <4 x i1> @v8i1_to_v4i1(<8 x i1> [[TMP15]])
				; CHECK-NEXT: [[TMP16:%.*]] = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> [[TMP14]], i32 32, <4 x i1> [[MASK]])
				; CHECK-NEXT: [[TMP17]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 1
				; CHECK-NEXT: [[TMP18:%.*]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 0
				; CHECK-NEXT: [[TMP19]] = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> [[TMP13]], <4 x float> [[TMP18]], <4 x i1> [[MASK]], <4 x float> [[TMP13]])
				; CHECK-NEXT: [[TMP20:%.*]] = icmp sgt i32 [[TMP12]], 4
				; CHECK-NEXT: [[TMP21]] = add i32 [[TMP12]], -4
				; CHECK-NEXT: br i1 [[TMP20]], label [[TMP11]], label [[TMP22:%.*]]
				; CHECK: 22:
				; CHECK-NEXT: [[TMP23:%.]] = tail call i32 bitcast (i32 (...) @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> [[TMP19]])
				; CHECK-NEXT: [[TMP24:%.*]] = sitofp i32 [[TMP23]] to float
				; CHECK-NEXT: [[TMP25:%.*]] = tail call float @llvm.fabs.f32(float [[TMP24]])
				; CHECK-NEXT: ret float [[TMP25]]
				;
				%3 = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				%4 = extractvalue { <4 x i32>, i32 } %3, 0
				%5 = add nsw i32 %1, -1
				%6 = ptrtoint float* %0 to i32
				%7 = insertelement <4 x i32> undef, i32 %6, i32 0
				%8 = add <4 x i32> %7, <i32 -32, i32 undef, i32 undef, i32 undef>
				%9 = shufflevector <4 x i32> %8, <4 x i32> undef, <4 x i32> zeroinitializer
				%10 = add <4 x i32> %4, %9
				br label %11

				11: ; preds = %11, %2
				%12 = phi i32 [ %5, %2 ], [ %20, %11 ]
				%13 = phi <4 x float> [ zeroinitializer, %2 ], [ %19, %11 ]
				%14 = phi <4 x i32> [ %10, %2 ], [ %17, %11 ]
				%15 = tail call <8 x i1> @llvm.arm.mve.vctp16(i32 %12)
				%mask = tail call <4 x i1> @v8i1_to_v4i1(<8 x i1> %15)
				%16 = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> %14, i32 32, <4 x i1> %mask)
				%17 = extractvalue { <4 x float>, <4 x i32> } %16, 1
				%18 = extractvalue { <4 x float>, <4 x i32> } %16, 0
				%19 = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> %13, <4 x float> %18, <4 x i1> %mask, <4 x float> %13)
				%20 = add nsw i32 %12, -4
				%21 = icmp sgt i32 %12, 4
				br i1 %21, label %11, label %22

				22: ; preds = %11
				%23 = tail call i32 bitcast (i32 (...)* @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> %19)
				%24 = sitofp i32 %23 to float
				%25 = tail call float @llvm.fabs.f32(float %24)
				ret float %25
				}

				define float @vctpi32(float* %0, i32 %1) {
				; CHECK-LABEL: @vctpi32(
				; CHECK-NEXT: [[TMP3:%.*]] = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				; CHECK-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, i32 } [[TMP3]], 0
				; CHECK-NEXT: [[TMP5:%.]] = add nsw i32 [[TMP1:%.]], -1
				; CHECK-NEXT: [[TMP6:%.]] = ptrtoint float [[TMP0:%.*]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
				; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i32> [[TMP7]], <i32 -32, i32 undef, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP10:%.*]] = add <4 x i32> [[TMP4]], [[TMP9]]
				; CHECK-NEXT: br label [[TMP11:%.*]]
				; CHECK: 11:
				; CHECK-NEXT: [[TMP12:%.]] = phi i32 [ [[TMP5]], [[TMP2:%.]] ], [ [[TMP21:%.*]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP13:%.]] = phi <4 x float> [ zeroinitializer, [[TMP2]] ], [ [[TMP19:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP14:%.]] = phi <4 x i32> [ [[TMP10]], [[TMP2]] ], [ [[TMP17:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP15:%.*]] = tail call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP12]])
				; CHECK-NEXT: [[TMP16:%.*]] = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> [[TMP14]], i32 32, <4 x i1> [[TMP15]])
				; CHECK-NEXT: [[TMP17]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 1
				; CHECK-NEXT: [[TMP18:%.*]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 0
				; CHECK-NEXT: [[TMP19]] = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> [[TMP13]], <4 x float> [[TMP18]], <4 x i1> [[TMP15]], <4 x float> [[TMP13]])
				; CHECK-NEXT: [[TMP20:%.*]] = icmp sgt i32 [[TMP12]], 4
				; CHECK-NEXT: [[TMP21]] = add i32 [[TMP12]], -4
				; CHECK-NEXT: br i1 [[TMP20]], label [[TMP11]], label [[TMP22:%.*]]
				; CHECK: 22:
				; CHECK-NEXT: [[TMP23:%.]] = tail call i32 bitcast (i32 (...) @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> [[TMP19]])
				; CHECK-NEXT: [[TMP24:%.*]] = sitofp i32 [[TMP23]] to float
				; CHECK-NEXT: [[TMP25:%.*]] = tail call float @llvm.fabs.f32(float [[TMP24]])
				; CHECK-NEXT: ret float [[TMP25]]
				;
				%3 = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				%4 = extractvalue { <4 x i32>, i32 } %3, 0
				%5 = add nsw i32 %1, -1
				%6 = ptrtoint float* %0 to i32
				%7 = insertelement <4 x i32> undef, i32 %6, i32 0
				%8 = add <4 x i32> %7, <i32 -32, i32 undef, i32 undef, i32 undef>
				%9 = shufflevector <4 x i32> %8, <4 x i32> undef, <4 x i32> zeroinitializer
				%10 = add <4 x i32> %4, %9
				br label %11

				11: ; preds = %11, %2
				%12 = phi i32 [ %5, %2 ], [ %20, %11 ]
				%13 = phi <4 x float> [ zeroinitializer, %2 ], [ %19, %11 ]
				%14 = phi <4 x i32> [ %10, %2 ], [ %17, %11 ]
				%15 = tail call <4 x i1> @llvm.arm.mve.vctp32(i32 %12)
				%16 = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> %14, i32 32, <4 x i1> %15)
				%17 = extractvalue { <4 x float>, <4 x i32> } %16, 1
				%18 = extractvalue { <4 x float>, <4 x i32> } %16, 0
				%19 = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> %13, <4 x float> %18, <4 x i1> %15, <4 x float> %13)
				%20 = add nsw i32 %12, -4
				%21 = icmp sgt i32 %12, 4
				br i1 %21, label %11, label %22

				22: ; preds = %11
				%23 = tail call i32 bitcast (i32 (...)* @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> %19)
				%24 = sitofp i32 %23 to float
				%25 = tail call float @llvm.fabs.f32(float %24)
				ret float %25
				}


				define float @vctpi64(float* %0, i32 %1) {
				; CHECK-LABEL: @vctpi64(
				; CHECK-NEXT: [[TMP3:%.*]] = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				; CHECK-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, i32 } [[TMP3]], 0
				; CHECK-NEXT: [[TMP5:%.]] = add nsw i32 [[TMP1:%.]], -1
				; CHECK-NEXT: [[TMP6:%.]] = ptrtoint float [[TMP0:%.*]] to i32
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
				; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i32> [[TMP7]], <i32 -32, i32 undef, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x i32> [[TMP8]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP10:%.*]] = add <4 x i32> [[TMP4]], [[TMP9]]
				; CHECK-NEXT: br label [[TMP11:%.*]]
				; CHECK: 11:
				; CHECK-NEXT: [[TMP12:%.]] = phi i32 [ [[TMP5]], [[TMP2:%.]] ], [ [[TMP21:%.*]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP13:%.]] = phi <4 x float> [ zeroinitializer, [[TMP2]] ], [ [[TMP19:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP14:%.]] = phi <4 x i32> [ [[TMP10]], [[TMP2]] ], [ [[TMP17:%.]], [[TMP11]] ]
				; CHECK-NEXT: [[TMP15:%.*]] = tail call <4 x i1> @llvm.arm.mve.vctp64(i32 [[TMP12]])
				; CHECK-NEXT: [[TMP16:%.*]] = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> [[TMP14]], i32 32, <4 x i1> [[TMP15]])
				; CHECK-NEXT: [[TMP17]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 1
				; CHECK-NEXT: [[TMP18:%.*]] = extractvalue { <4 x float>, <4 x i32> } [[TMP16]], 0
				; CHECK-NEXT: [[TMP19]] = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> [[TMP13]], <4 x float> [[TMP18]], <4 x i1> [[TMP15]], <4 x float> [[TMP13]])
				; CHECK-NEXT: [[TMP20:%.*]] = icmp sgt i32 [[TMP12]], 4
				; CHECK-NEXT: [[TMP21]] = add i32 [[TMP12]], -4
				; CHECK-NEXT: br i1 [[TMP20]], label [[TMP11]], label [[TMP22:%.*]]
				; CHECK: 22:
				; CHECK-NEXT: [[TMP23:%.]] = tail call i32 bitcast (i32 (...) @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> [[TMP19]])
				; CHECK-NEXT: [[TMP24:%.*]] = sitofp i32 [[TMP23]] to float
				; CHECK-NEXT: [[TMP25:%.*]] = tail call float @llvm.fabs.f32(float [[TMP24]])
				; CHECK-NEXT: ret float [[TMP25]]
				;
				%3 = tail call { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32 0, i32 8)
				%4 = extractvalue { <4 x i32>, i32 } %3, 0
				%5 = add nsw i32 %1, -1
				%6 = ptrtoint float* %0 to i32
				%7 = insertelement <4 x i32> undef, i32 %6, i32 0
				%8 = add <4 x i32> %7, <i32 -32, i32 undef, i32 undef, i32 undef>
				%9 = shufflevector <4 x i32> %8, <4 x i32> undef, <4 x i32> zeroinitializer
				%10 = add <4 x i32> %4, %9
				br label %11

				11: ; preds = %11, %2
				%12 = phi i32 [ %5, %2 ], [ %20, %11 ]
				%13 = phi <4 x float> [ zeroinitializer, %2 ], [ %19, %11 ]
				%14 = phi <4 x i32> [ %10, %2 ], [ %17, %11 ]
				%15 = tail call <4 x i1> @llvm.arm.mve.vctp64(i32 %12)
				%16 = tail call { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32> %14, i32 32, <4 x i1> %15)
				%17 = extractvalue { <4 x float>, <4 x i32> } %16, 1
				%18 = extractvalue { <4 x float>, <4 x i32> } %16, 0
				%19 = tail call <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float> %13, <4 x float> %18, <4 x i1> %15, <4 x float> %13)
				%20 = add nsw i32 %12, -4
				%21 = icmp sgt i32 %12, 4
				br i1 %21, label %11, label %22

				22: ; preds = %11
				%23 = tail call i32 bitcast (i32 (...)* @vecAddAcrossF32Mve to i32 (<4 x float>)*)(<4 x float> %19)
				%24 = sitofp i32 %23 to float
				%25 = tail call float @llvm.fabs.f32(float %24)
				ret float %25
				}

				declare { <4 x i32>, i32 } @llvm.arm.mve.vidup.v4i32(i32, i32)
				declare <16 x i1> @llvm.arm.mve.vctp8(i32)
				declare <8 x i1> @llvm.arm.mve.vctp16(i32)
				declare <4 x i1> @llvm.arm.mve.vctp32(i32)
				declare <4 x i1> @llvm.arm.mve.vctp64(i32)
				declare { <4 x float>, <4 x i32> } @llvm.arm.mve.vldr.gather.base.wb.predicated.v4f32.v4i32.v4i1(<4 x i32>, i32, <4 x i1>)
				declare <4 x float> @llvm.arm.mve.add.predicated.v4f32.v4i1(<4 x float>, <4 x float>, <4 x i1>, <4 x float>)
				declare i32 @vecAddAcrossF32Mve(...)
				declare <4 x i1> @v8i1_to_v4i1(<8 x i1>)
				declare <4 x i1> @v16i1_to_v4i1(<16 x i1>)
				declare float @llvm.fabs.f32(float)

This is an archive of the discontinued LLVM Phabricator instance.

[LSR][ARM] Add new TTI hook to mark some LSR chains as profitableClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 263692

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp

llvm/test/CodeGen/Thumb2/LowOverheadLoops/lsr-profitable-chain.ll

llvm/test/Transforms/LoopStrengthReduce/ARM/vctp-chains.ll

[LSR][ARM] Add new TTI hook to mark some LSR chains as profitable
ClosedPublic