This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Intrinsics for Hardware Loops
AbandonedPublic

Authored by samparker on May 20 2019, 3:29 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
efriedma
dmgreen
SjoerdMeijer
sdesmalen
kparzysz
hfinkel

Summary

Arm have recently announced the v8.1-M architecture specification for
our next generation microcontrollers. The architecture includes
vector extensions (MVE) and support for low-overhead branches (LoB),
which can be thought of a style of hardware loop. Hardware loops
aren't new to LLVM, other backends (at least Hexagon and PPC that I
know of) also include support. These implementations insert the loop
controlling instructions at the MachineInstr level and I'd like to
propose that we add intrinsics to support this notion at the IR
level; primarily to be able to use scalar evolution to understand the
loops instead of having to implement a machine-level analysis for
each target.

The attached prototype implementation contains intrinsics that are
currently Arm specific, but I hope they're general enough to be used
by all targets. The Arm v8.1-m architecture supports do-while and
while loops, but for conciseness, here, I'd like to just focus on
while loops. There's two parts to this RFC: (1) the intrinsics
and (2) a prototype implementation in the Arm backend to enable
tail-predicated machine loops.

LLVM IR Intrinsics

In the following definitions, I use the term 'element' to describe
the work performed by an IR loop that has not been vectorized or
unrolled by the compiler. This should be equivalent to the loop at
the source level.

void @llvm.arm.set.loop.iterations(i32)

Takes as a single operand, the number of iterations to be executed.

i32 @llvm.arm.set.loop.elements(i32, i32)

Takes two operands:
- The total number of elements to be processed by the loop.
- The maximum number of elements processed in one iteration of the IR loop body.
Returns the number of iterations to be executed.

<X x i1> @llvm.arm.get.active.mask.X(i32)

Takes as an operand, the number of elements that still need processing.
Where 'X' denotes the vectorization factor, returns an array of i1 indicating which vector lanes are active for the current loop iteration.

i32 @llvm.arm.loop.end(i32, i32)

Takes two operands:
- The number of elements that still need processing.
- The maximum number of elements processed in one iteration of the IR loop body.

The following gives an illustration of their intended usage:

entry:

%0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
%1 = icmp ne i32 %0, 0
br i1 %1, label %vector.ph, label %for.loopexit

vector.ph:

br label %vector.body

vector.body:

%elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
%active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
%load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)
%elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
%cmp = icmp sgt i32 %elts.rem, 0
br i1 %cmp, label %vector.body, label %for.loopexit

for.loopexit:

ret void

As the example shows, control-flow is still ultimately performed
through the icmp and br pair. There's nothing connecting the
intrinsics to a given loop or any requirement that a set.loop.* call
needs to be paired with a loop.end call.

Low-overhead loops in the Arm backend

Disclaimer: The prototype is barebones and reuses parts of NEON and
I'm currently targeting the Cortex-A72 which does not support this
feature! opt and llc build and the provided test case doesn't cause a
crash...

The low-overhead branch extension can be combined with MVE to
generate vectorized loops in which the epilogue is executed within
the predicated vector body. The proposal is for this to be supported
through a series of pass:

IR LoopPass to identify suitable loops and insert the intrinsics proposed above.
DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo instruction.
A final MachineFunctionPass to expand the pseudo instructions.

To help / enable the lowering of of an i1 vector, the VPR register has
been added. This is a status register that contains the P0 predicate
and is also used to model the implicit predicates of tail-predicated
loops.

There are two main reasons why pseudo instructions are used instead
of generating MIs directly during ISel:

They gives us a chance of later inspecting the whole loop and confirm that it's a good idea to generate such a loop. This is trivial for scalar loops, but not really applicable for tail-predicated loops.
It allows us to separate the decrementing of the loop counter with the instruction that branches back, which should help us recover if LR gets spilt between these two pseudo ops.

For Armv8.1-M, the while.setup intrinsic is used to generate the wls
and wlstp instructions, while loop.end generates the le and letp
instructions. The active.mask can just be removed because the lane
predication is handled implicitly.

I'm not sure of the vectorizers limitations of generating vector
instructions that operate across lanes, such as reductions, when
generating a predicated loop but this needs to be considered.

Diff Detail

Event Timeline

samparker created this revision.May 20 2019, 3:29 AM

Herald added subscribers: kristof.beyls, javed.absar, mgorny. · View Herald TranscriptMay 20 2019, 3:29 AM

Hi Sam, many thanks for the detailed RFC and prototype!

Of course I need some more time to digest this, but just a first nitpick of something I noticed:

To help / enable the lowering of of an i1 vector, the VPR register has been added. This is a status register that contains the P0 predicate and is also used to model the implicit predicates of tail-predicated loops.

Loop tail predication and VPT block predication use different mechanism, architecturally. The former uses FPSCR.LTPSIZE, and the latter VPR, right? But I don't think it matters or changes anything for the rest of your story.

In D62132#1508322, @SjoerdMeijer wrote:

Loop tail predication and VPT block predication use different mechanism, architecturally. The former uses FPSCR.LTPSIZE, and the latter VPR, right? But I don't think it matters or changes anything for the rest of your story.

Indeed, the register is only used for code generation here. I expect that we will need separate registers in the compiler for each in the final implementation as both architecture registers are used at runtime if there's a VPT block within a tail-predicated loop.

Changed the loop.end icmp to use ne instead of sgt.

markus added a subscriber: markus.May 21 2019, 12:25 AM

This is interesting. Our (Ericsson's) out-of-tree target has hardware loops and we currently do a similar thing i.e.

Use SCEV on IR to determine trip count and insert a hint intrinsic.
Iselect the hint 1:1.
Custom machine function pass picks it up and inserts a hardware loop instruction if sufficient conditions are satisfied.

So that said we would be happy to see generic support upstream and would likely convert to that once available.

Great! I'll make start on a target-independent framework, combining this and the PPC implementation.

JonChesterfield added a subscriber: JonChesterfield.May 21 2019, 7:06 AM

psnobl added a subscriber: psnobl.May 21 2019, 7:18 AM

I've implemented this out of tree (for Graphcore), based loosely on the PPC implementation. IR pass based on SCEV inserts intrinsics, SDag patches them up a little, MIR pass picks appropriate instructions or falls back to a decrement and branch loop.

As the example shows, control-flow is still ultimately performed through the icmp and br pair.

This is interesting. We've also gone with a pair of intrinsics in IR - one in the loop preheader that takes an integer for trip count (backedge + 1), one in the body which returns an i1 that goes to brcond. Ideally I'd have liked to use an intrinsic to represent the control flow. An opaque intrinsic returning a boolean for brcond is approximately the same. An integer is threaded between the two in order to keep a GP register live until the back end where the lowering to ISA may need it, but that's probably unique to our arch.

Folding the icmp behaviour into the intrinsic instead stands a reasonable chance of making the induction variable dead (well, effectively hidden in the intrinsics) with subsequent IR simplification. Is the advantage to keeping it explicit in the interaction with other loop passes?

In SDag for our target, each IR intrinsic turns into two pseudo instructions. One representing the branch, one representing the arithmetic. It seems you've continued with two intrinsics, using a specific register and side effects to represent the combination. For example, the 'decrement arithmetic' pseudo turns into either a subtract one or into no code, depending on the instructions chosen. Keeping it separate from the terminator seems to help position it in a reasonable place in the BB for the decrement. I'm not sure which approach is better.

Would your implementation make sense without the masked load/store support? I'm wondering if that's a reasonable place to split the patch.

Thanks for the diff!

Hi Jon,

I used a (br (icmp (intrinsic))) combination just because we don't have native i1 support in the Arm backend, I've been lazy to get a quick prototype together. I want these intrinsics to return an i1 to be used directly the brcond. We also want to keep a GP live too (our loop counter lives in the link register) so it sounds very similar to you.

I've focused on supporting masked load/stores because it is the more involved side of this transform so I wanted a proof-of-concept. I'm currently converting the PPC pass to a target independent pass and that appears to handle the 'normal' loops fine. I'll try to post that up tomorrow and if people are reasonably happy with the approach, I'll break it up again and leave the predicate support until last.

Thanks for taking a look.

Introduced a target independent pass to insert intrinsics to generate hardware loops, which is based upon PPCCTRLoops. A hook has been added to TTI to decide what loops should be converted:

bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
                                                      AssumptionCache &AC,
                                                      TargetLibraryInfo *LibInfo,
                                                      HardwareLoopInfo &HWLoopInfo);

HardwareLoopInfo is introduced to allow the backend to describe the properties of the loop:

struct HardwareLoopInfo {
      HardwareLoopInfo(Loop *L) : L(L) { }
      Loop *L                 = nullptr;
      BasicBlock *ExitBlock   = nullptr;
      BranchInst *ExitBranch  = nullptr;
      const SCEV *ExitCount   = nullptr;
      Instruction *Predicate  = nullptr; // Value controlling masked ops.
      IntegerType *CountType  = nullptr;
      bool PerformTest        = false;   // Can guard loop entry.
      bool IsNestingLegal     = false;   // Can HW loops be nested.
      bool InsertPHICounter   = false;   // Keep the loop counter in reg?
      unsigned NumElements    = 1;       // Max number of elements
                                         // processed in an iteration.
};

The pass can insert four different intrinsics for setting up a loop:

int_set_loop_iterations: Takes an integer trip count.
int_test_set_loop_iterations : Takes an integer trip count and tests whether the loop should be entered.
int_set_loop_elements : Takes two integers, (1) the total elements to be processed by the loop and (2) the maximum number of elements processed in each iteration.
int_test_set_loop_elements : Same as above but also tests whether the loop should be entered.

PowerPC codegen tests are still passing and the hacks in the arm backend allow by previous example to still work.

My plan now is:

Introduce the target independent with just the features (set_loop_iterations and loop_dec) to enable PowerPC and remove PPCTRLoops.
Add Arm support for the previous two intrinsics.
Introduce the testing form of set_loop_iterations with Arm support.
Introduce the 'element' versions of the intrinsics with Arm support.

The Arm support will be dependent on getting the initial architecture support upstream.

Herald added subscribers: jsji, jfb, kbarton, nemanjai. · View Herald TranscriptMay 23 2019, 2:15 AM

jsji added a subscriber: shchenz.May 23 2019, 7:14 AM

Would it be an idea if we start splitting this up now? Because it looks like there is a lot of support for this, there's consensus on the direction (there are only minor implementation differences). We've got at least 3 patches here: the target independent hwloop pass, the changes to the PPC backend, the changes to the ARM backend, and perhaps a 4th patch is the intrinsics. That's a lot of code, and then we can start reviewing things separately.

dmitry added a subscriber: dmitry.May 27 2019, 11:09 AM

Created the initial patch to convert the PowerPC pass into something generic: https://reviews.llvm.org/D62604

Patch to enable Arm code generation for do-while loops: https://reviews.llvm.org/D63476

Introduce an intrinsic, and generic support, so that we can set the loop counter and also test that it is not zero: https://reviews.llvm.org/D63809

Closing:

The generic hardware loop support has been pulled out of PPC and into CodeGen, with new intrinsics added too.
Arm codegen support has been added for scalar low-overhead loops.
Work is ongoing in the vectorizer to control predication.
A pass has been added into the Arm backend to massage the IR in preparation for tail predication.

Herald added subscribers: jdoerfert, • wuzish, MaskRay. · View Herald TranscriptSep 13 2019, 5:14 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

32 lines

TargetTransformInfoImpl.h

7 lines

CodeGen/

BasicTTIImpl.h

7 lines

Passes.h

2 lines

IR/

Intrinsics.td

19 lines

InitializePasses.h

1 line

lib/

Analysis/

TargetTransformInfo.cpp

6 lines

CodeGen/

CMakeLists.txt

1 line

HardwareLoops.cpp

483 lines

Target/

ARM/

ARM.h

1 line

ARMFinalizeHardwareLoops.cpp

256 lines

47 lines

2 lines

32 lines

6 lines

83 lines

5 lines

5 lines

ARMTargetTransformInfo.h

9 lines

ARMTargetTransformInfo.cpp

64 lines

CMakeLists.txt

1 line

PowerPC/

572 lines

18 lines

2 lines

2 lines

3 lines

PPCTargetTransformInfo.h

5 lines

PPCTargetTransformInfo.cpp

344 lines

test/

CodeGen/

PowerPC/

ctrloop-intrin.ll

11 lines

ppc-passname.ll

12 lines

Thumb2/

mve-tailpred.ll

78 lines

Diff 200890

include/llvm/Analysis/TargetTransformInfo.h

Show All 29 Lines
#include <functional>		#include <functional>

namespace llvm {		namespace llvm {

namespace Intrinsic {		namespace Intrinsic {
enum ID : unsigned;		enum ID : unsigned;
}		}

		class AssumptionCache;
		class BranchInst;
class Function;		class Function;
class GlobalValue;		class GlobalValue;
class IntrinsicInst;		class IntrinsicInst;
class LoadInst;		class LoadInst;
class Loop;		class Loop;
class SCEV;		class SCEV;
class ScalarEvolution;		class ScalarEvolution;
class StoreInst;		class StoreInst;
class SwitchInst;		class SwitchInst;
		class TargetLibraryInfo;
class Type;		class Type;
class User;		class User;
class Value;		class Value;

/// Information about a load/store intrinsic defined by the target.		/// Information about a load/store intrinsic defined by the target.
struct MemIntrinsicInfo {		struct MemIntrinsicInfo {
/// This is the pointer that the intrinsic is loading from or storing to.		/// This is the pointer that the intrinsic is loading from or storing to.
/// If this is non-null, then analysis/optimization passes can assume that		/// If this is non-null, then analysis/optimization passes can assume that
▲ Show 20 Lines • Show All 385 Lines • ▼ Show 20 Lines	public:
};		};

/// Get target-customized preferences for the generic loop unrolling		/// Get target-customized preferences for the generic loop unrolling
/// transformation. The caller will initialize UP with the current		/// transformation. The caller will initialize UP with the current
/// target-independent defaults.		/// target-independent defaults.
void getUnrollingPreferences(Loop *L, ScalarEvolution &,		void getUnrollingPreferences(Loop *L, ScalarEvolution &,
UnrollingPreferences &UP) const;		UnrollingPreferences &UP) const;

		struct HardwareLoopInfo {
		HardwareLoopInfo(Loop *L) : L(L) { }
		Loop *L = nullptr;
		BasicBlock *ExitBlock = nullptr;
		BranchInst *ExitBranch = nullptr;
		const SCEV *ExitCount = nullptr;
		Instruction *Predicate = nullptr;
		IntegerType *CountType = nullptr;
		bool PerformTest = false;
		bool IsNestingLegal = false;
		bool InsertPHICounter = false;
		unsigned NumElements = 1;
		};

		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		HardwareLoopInfo &HWLoopInfo) const;

/// @}		/// @}

/// \name Scalar Target Information		/// \name Scalar Target Information
/// @{		/// @{

/// Flags indicating the kind of support for population count.		/// Flags indicating the kind of support for population count.
///		///
/// Compared to the SW implementation, HW support is supposed to		/// Compared to the SW implementation, HW support is supposed to
▲ Show 20 Lines • Show All 612 Lines • ▼ Show 20 Lines	public:
getUserCost(const User U, ArrayRef<const Value > Operands) = 0;		getUserCost(const User U, ArrayRef<const Value > Operands) = 0;
virtual bool hasBranchDivergence() = 0;		virtual bool hasBranchDivergence() = 0;
virtual bool isSourceOfDivergence(const Value *V) = 0;		virtual bool isSourceOfDivergence(const Value *V) = 0;
virtual bool isAlwaysUniform(const Value *V) = 0;		virtual bool isAlwaysUniform(const Value *V) = 0;
virtual unsigned getFlatAddressSpace() = 0;		virtual unsigned getFlatAddressSpace() = 0;
virtual bool isLoweredToCall(const Function *F) = 0;		virtual bool isLoweredToCall(const Function *F) = 0;
virtual void getUnrollingPreferences(Loop *L, ScalarEvolution &,		virtual void getUnrollingPreferences(Loop *L, ScalarEvolution &,
UnrollingPreferences &UP) = 0;		UnrollingPreferences &UP) = 0;
		virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		HardwareLoopInfo &HWLoopInfo) = 0;
virtual bool isLegalAddImmediate(int64_t Imm) = 0;		virtual bool isLegalAddImmediate(int64_t Imm) = 0;
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace,		unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	public:

bool isLoweredToCall(const Function *F) override {		bool isLoweredToCall(const Function *F) override {
return Impl.isLoweredToCall(F);		return Impl.isLoweredToCall(F);
}		}
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
UnrollingPreferences &UP) override {		UnrollingPreferences &UP) override {
return Impl.getUnrollingPreferences(L, SE, UP);		return Impl.getUnrollingPreferences(L, SE, UP);
}		}
		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		HardwareLoopInfo &HWLoopInfo) override {
		return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
		}
bool isLegalAddImmediate(int64_t Imm) override {		bool isLegalAddImmediate(int64_t Imm) override {
return Impl.isLegalAddImmediate(Imm);		return Impl.isLegalAddImmediate(Imm);
}		}
bool isLegalICmpImmediate(int64_t Imm) override {		bool isLegalICmpImmediate(int64_t Imm) override {
return Impl.isLegalICmpImmediate(Imm);		return Impl.isLegalICmpImmediate(Imm);
}		}
bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
▲ Show 20 Lines • Show All 444 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	if (Name == "pow" \|\| Name == "powf" \|\| Name == "powl" \|\| Name == "exp2" \|\|
Name == "floorf" \|\| Name == "ceil" \|\| Name == "round" \|\|		Name == "floorf" \|\| Name == "ceil" \|\| Name == "round" \|\|
Name == "ffs" \|\| Name == "ffsl" \|\| Name == "abs" \|\| Name == "labs" \|\|		Name == "ffs" \|\| Name == "ffsl" \|\| Name == "abs" \|\| Name == "labs" \|\|
Name == "llabs")		Name == "llabs")
return false;		return false;

return true;		return true;
}		}

		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		TTI::HardwareLoopInfo &HWLoopInfo) {
		return false;
		}

void getUnrollingPreferences(Loop *, ScalarEvolution &,		void getUnrollingPreferences(Loop *, ScalarEvolution &,
TTI::UnrollingPreferences &) {}		TTI::UnrollingPreferences &) {}

bool isLegalAddImmediate(int64_t Imm) { return false; }		bool isLegalAddImmediate(int64_t Imm) { return false; }

bool isLegalICmpImmediate(int64_t Imm) { return false; }		bool isLegalICmpImmediate(int64_t Imm) { return false; }

bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
▲ Show 20 Lines • Show All 674 Lines • Show Last 20 Lines

include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 479 Lines • ▼ Show 20 Lines	void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
UP.PartialOptSizeThreshold = 0;		UP.PartialOptSizeThreshold = 0;

// Set number of instructions optimized when "back edge"		// Set number of instructions optimized when "back edge"
// becomes "fall through" to default value of 2.		// becomes "fall through" to default value of 2.
UP.BEInsns = 2;		UP.BEInsns = 2;
}		}

		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		TTI::HardwareLoopInfo &HWLoopInfo) {
		return BaseT::isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
		}

int getInstructionLatency(const Instruction *I) {		int getInstructionLatency(const Instruction *I) {
if (isa<LoadInst>(I))		if (isa<LoadInst>(I))
return getST()->getSchedModel().DefaultLoadLatency;		return getST()->getSchedModel().DefaultLoadLatency;

return BaseT::getInstructionLatency(I);		return BaseT::getInstructionLatency(I);
}		}

/// @}		/// @}
▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

include/llvm/CodeGen/Passes.h

Show First 20 Lines • Show All 440 Lines • ▼ Show 20 Lines	/// MachineDominanaceFrontier - This pass is a machine dominators analysis pass.
FunctionPass *createBreakFalseDeps();		FunctionPass *createBreakFalseDeps();

// This pass expands indirectbr instructions.		// This pass expands indirectbr instructions.
FunctionPass *createIndirectBrExpandPass();		FunctionPass *createIndirectBrExpandPass();

/// Creates CFI Instruction Inserter pass. \see CFIInstrInserter.cpp		/// Creates CFI Instruction Inserter pass. \see CFIInstrInserter.cpp
FunctionPass *createCFIInstrInserter();		FunctionPass *createCFIInstrInserter();

		FunctionPass *createHardwareLoops();

} // End llvm namespace		} // End llvm namespace

#endif		#endif

include/llvm/IR/Intrinsics.td

	Show First 20 Lines • Show All 992 Lines • ▼ Show 20 Lines
	[IntrNoMem]>;			[IntrNoMem]>;
	def int_experimental_vector_reduce_fmax : Intrinsic<[llvm_anyfloat_ty],			def int_experimental_vector_reduce_fmax : Intrinsic<[llvm_anyfloat_ty],
	[llvm_anyvector_ty],			[llvm_anyvector_ty],
	[IntrNoMem]>;			[IntrNoMem]>;
	def int_experimental_vector_reduce_fmin : Intrinsic<[llvm_anyfloat_ty],			def int_experimental_vector_reduce_fmin : Intrinsic<[llvm_anyfloat_ty],
	[llvm_anyvector_ty],			[llvm_anyvector_ty],
	[IntrNoMem]>;			[IntrNoMem]>;

				def int_set_loop_iterations :
				Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;

				def int_test_set_loop_iterations :
				Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;

				def int_set_loop_elements :
				Intrinsic<[], [llvm_anyint_ty, llvm_anyint_ty], [IntrNoDuplicate]>;

				def int_test_set_loop_elements :
				Intrinsic<[llvm_i1_ty], [llvm_anyint_ty, llvm_anyint_ty], [IntrNoDuplicate]>;

				def int_loop_dec :
				Intrinsic<[llvm_anyint_ty],
				[llvm_anyint_ty, llvm_anyint_ty], [IntrNoDuplicate]>;

				def int_get_active_mask_4 :
				Intrinsic<[llvm_v4i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;

	//===----- Intrinsics that are used to provide predicate information -----===//			//===----- Intrinsics that are used to provide predicate information -----===//

	def int_ssa_copy : Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],			def int_ssa_copy : Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
	[IntrNoMem, Returned<0>]>;			[IntrNoMem, Returned<0>]>;
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Target-specific intrinsics			// Target-specific intrinsics
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	Show All 13 Lines

include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	void initializeGVNLegacyPassPass(PassRegistry&);			void initializeGVNLegacyPassPass(PassRegistry&);
	void initializeGVNSinkLegacyPassPass(PassRegistry&);			void initializeGVNSinkLegacyPassPass(PassRegistry&);
	void initializeGlobalDCELegacyPassPass(PassRegistry&);			void initializeGlobalDCELegacyPassPass(PassRegistry&);
	void initializeGlobalMergePass(PassRegistry&);			void initializeGlobalMergePass(PassRegistry&);
	void initializeGlobalOptLegacyPassPass(PassRegistry&);			void initializeGlobalOptLegacyPassPass(PassRegistry&);
	void initializeGlobalSplitPass(PassRegistry&);			void initializeGlobalSplitPass(PassRegistry&);
	void initializeGlobalsAAWrapperPassPass(PassRegistry&);			void initializeGlobalsAAWrapperPassPass(PassRegistry&);
	void initializeGuardWideningLegacyPassPass(PassRegistry&);			void initializeGuardWideningLegacyPassPass(PassRegistry&);
				void initializeHardwareLoopsPass(PassRegistry&);
	void initializeHotColdSplittingLegacyPassPass(PassRegistry&);			void initializeHotColdSplittingLegacyPassPass(PassRegistry&);
	void initializeHWAddressSanitizerLegacyPassPass(PassRegistry &);			void initializeHWAddressSanitizerLegacyPassPass(PassRegistry &);
	void initializeIPCPPass(PassRegistry&);			void initializeIPCPPass(PassRegistry&);
	void initializeIPSCCPLegacyPassPass(PassRegistry&);			void initializeIPSCCPLegacyPassPass(PassRegistry&);
	void initializeIRCELegacyPassPass(PassRegistry&);			void initializeIRCELegacyPassPass(PassRegistry&);
	void initializeIRTranslatorPass(PassRegistry&);			void initializeIRTranslatorPass(PassRegistry&);
	void initializeIVUsersWrapperPassPass(PassRegistry&);			void initializeIVUsersWrapperPassPass(PassRegistry&);
	void initializeIfConverterPass(PassRegistry&);			void initializeIfConverterPass(PassRegistry&);
	▲ Show 20 Lines • Show All 246 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLoweredToCall(const Function *F) const {
return TTIImpl->isLoweredToCall(F);		return TTIImpl->isLoweredToCall(F);
}		}

void TargetTransformInfo::getUnrollingPreferences(		void TargetTransformInfo::getUnrollingPreferences(
Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {		Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {
return TTIImpl->getUnrollingPreferences(L, SE, UP);		return TTIImpl->getUnrollingPreferences(L, SE, UP);
}		}

		bool TargetTransformInfo::isHardwareLoopProfitable(
		Loop *L, ScalarEvolution &SE, AssumptionCache &AC,
		TargetLibraryInfo *LibInfo, HardwareLoopInfo &HWLoopInfo) const {
		return TTIImpl->isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
		}

bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {		bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {
return TTIImpl->isLegalAddImmediate(Imm);		return TTIImpl->isLegalAddImmediate(Imm);
}		}

bool TargetTransformInfo::isLegalICmpImmediate(int64_t Imm) const {		bool TargetTransformInfo::isLegalICmpImmediate(int64_t Imm) const {
return TTIImpl->isLegalICmpImmediate(Imm);		return TTIImpl->isLegalICmpImmediate(Imm);
}		}

▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/CodeGen/CMakeLists.txt

Show All 27 Lines	add_llvm_library(LLVMCodeGen
FaultMaps.cpp		FaultMaps.cpp
FEntryInserter.cpp		FEntryInserter.cpp
FuncletLayout.cpp		FuncletLayout.cpp
GCMetadata.cpp		GCMetadata.cpp
GCMetadataPrinter.cpp		GCMetadataPrinter.cpp
GCRootLowering.cpp		GCRootLowering.cpp
GCStrategy.cpp		GCStrategy.cpp
GlobalMerge.cpp		GlobalMerge.cpp
		HardwareLoops.cpp
IfConversion.cpp		IfConversion.cpp
ImplicitNullChecks.cpp		ImplicitNullChecks.cpp
IndirectBrExpandPass.cpp		IndirectBrExpandPass.cpp
InlineSpiller.cpp		InlineSpiller.cpp
InterferenceCache.cpp		InterferenceCache.cpp
InterleavedAccessPass.cpp		InterleavedAccessPass.cpp
InterleavedLoadCombinePass.cpp		InterleavedLoadCombinePass.cpp
IntrinsicLowering.cpp		IntrinsicLowering.cpp
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

lib/CodeGen/HardwareLoops.cpp

This file was added.

				#include "llvm/Pass.h"
				#include "llvm/PassRegistry.h"
				#include "llvm/PassSupport.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/Analysis/CFG.h"
				#include "llvm/Analysis/LoopInfo.h"
				#include "llvm/Analysis/LoopIterator.h"
				#include "llvm/Analysis/ScalarEvolution.h"
				#include "llvm/Analysis/ScalarEvolutionExpander.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/CodeGen/Passes.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/BasicBlock.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/Value.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/Scalar.h"
				#include "llvm/Transforms/Utils.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/Local.h"
				#include "llvm/Transforms/Utils/LoopUtils.h"

				using namespace llvm;

				#define DEBUG_TYPE "hardware-loops"

				#define HW_LOOPS_NAME "Hardware Loop Insertion"

				STATISTIC(NumHWLoops, "Number of loops converted to hardware loops");

				namespace {

				using TTI = TargetTransformInfo;

				class HardwareLoops : public FunctionPass {
				public:
				static char ID;

				HardwareLoops() : FunctionPass(ID) {
				initializeHardwareLoopsPass(*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addPreserved<LoopInfoWrapperPass>();
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.addPreserved<DominatorTreeWrapperPass>();
				AU.addRequired<ScalarEvolutionWrapperPass>();
				AU.addRequired<AssumptionCacheTracker>();
				AU.addRequired<TargetTransformInfoWrapperPass>();
				}

				bool TryConvertLoop(Loop *L);
				bool TryConvertLoop(TTI::HardwareLoopInfo &HWLoopInfo);
				void ConvertLoop(TTI::HardwareLoopInfo &HWLoopInfo);

				private:
				ScalarEvolution *SE = nullptr;
				LoopInfo *LI = nullptr;
				const DataLayout *DL = nullptr;
				const TargetTransformInfo *TTI = nullptr;
				DominatorTree *DT = nullptr;
				bool PreserveLCSSA = false;
				AssumptionCache *AC = nullptr;
				TargetLibraryInfo *LibInfo = nullptr;
				Module *M = nullptr;
				};
				}

				char HardwareLoops::ID = 0;

				bool HardwareLoops::runOnFunction(Function &F) {
				if (skipFunction(F))
				return false;

				LLVM_DEBUG(dbgs() << "HWLoops: Running on " << F.getName() << "\n");
				auto *TPC = getAnalysisIfAvailable<TargetPassConfig>();
				if (!TPC)
				return false;

				LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
				SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
				DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
				DL = &F.getParent()->getDataLayout();
				auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
				LibInfo = TLIP ? &TLIP->getTLI() : nullptr;
				PreserveLCSSA = mustPreserveAnalysisID(LCSSAID);
				AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
				M = F.getParent();

				bool MadeChange = false;

				for (LoopInfo::iterator I = LI->begin(), E = LI->end(); I != E; ++I) {
				Loop L = I;

				if (!L->getParentLoop())
				MadeChange \|= TryConvertLoop(L);
				}

				return MadeChange;
				}

				bool HardwareLoops::TryConvertLoop(Loop *L) {
				bool MadeChange = false;

				// Process nested loops first.
				for (Loop::iterator I = L->begin(), E = L->end(); I != E; ++I) {
				MadeChange \|= TryConvertLoop(*I);
				}

				if (MadeChange)
				return true;

				// Bail out if the loop has irreducible control flow.
				LoopBlocksRPO RPOT(L);
				RPOT.perform(LI);
				if (containsIrreducibleCFG<const BasicBlock >(RPOT, LI)) {
				LLVM_DEBUG(dbgs() << "HWLoops: Loop contains irreducible CFG.\n");
				return false;
				}

				TTI::HardwareLoopInfo HWLoopInfo(L);
				if (!TTI->isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo)) {
				LLVM_DEBUG(dbgs() << "HWLoops: Not profitable to convert loop.\n");
				return MadeChange;
				}

				MadeChange \|= TryConvertLoop(HWLoopInfo);
				return MadeChange;
				}

				bool HardwareLoops::TryConvertLoop(TTI::HardwareLoopInfo &HWLoopInfo) {

				Loop *L = HWLoopInfo.L;
				//BasicBlock *CountedExitBlock = nullptr;
				//const SCEV *ExitCount = nullptr;
				//BranchInst *CountedExitBranch = nullptr;
				SmallVector<BasicBlock*, 4> ExitingBlocks;
				L->getExitingBlocks(ExitingBlocks);

				LLVM_DEBUG(dbgs() << "HWLoops: Try to convert profitable do loop: "
				<< *L << "\n");

				for (SmallVectorImpl<BasicBlock *>::iterator I = ExitingBlocks.begin(),
				IE = ExitingBlocks.end(); I != IE; ++I) {
				const SCEV EC = SE->getExitCount(L, I);
				LLVM_DEBUG(dbgs() << "HWLoops: Exit Count for " << *L << " from block "
				<< (I)->getName() << ": " << EC << "\n");
				if (isa<SCEVCouldNotCompute>(EC))
				continue;
				if (const SCEVConstant *ConstEC = dyn_cast<SCEVConstant>(EC)) {
				if (ConstEC->getValue()->isZero())
				continue;
				} else if (!SE->isLoopInvariant(EC, L))
				continue;

				if (SE->getTypeSizeInBits(EC->getType()) >
				HWLoopInfo.CountType->getBitWidth())
				continue;

				// If this exiting block is contained in a nested loop, it is not eligible
				// for insertion of the branch-and-decrement since the inner loop would
				// end up messing up the value in the CTR.
				if (!HWLoopInfo.IsNestingLegal && LI->getLoopFor(*I) != L)
				continue;

				// We now have a loop-invariant count of loop iterations (which is not the
				// constant zero) for which we know that this loop will not exit via this
				// existing block.

				// We need to make sure that this block will run on every loop iteration.
				// For this to be true, we must dominate all blocks with backedges. Such
				// blocks are in-loop predecessors to the header block.
				bool NotAlways = false;
				for (pred_iterator PI = pred_begin(L->getHeader()),
				PIE = pred_end(L->getHeader()); PI != PIE; ++PI) {
				if (!L->contains(*PI))
				continue;

				if (!DT->dominates(I, PI)) {
				NotAlways = true;
				break;
				}
				}

				if (NotAlways)
				continue;

				// Make sure this blocks ends with a conditional branch.
				Instruction TI = (I)->getTerminator();
				if (!TI)
				continue;

				if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {
				if (!BI->isConditional())
				continue;

				HWLoopInfo.ExitBranch = BI;
				} else
				continue;

				// Note that this block may not be the loop latch block, even if the loop
				// has a latch block.
				HWLoopInfo.ExitBlock = *I;
				HWLoopInfo.ExitCount = EC;
				break;
				}

				if (!HWLoopInfo.ExitBlock) {
				LLVM_DEBUG(dbgs() << "HWLoops: Unable to find CountExitBlock.\n");
				return false;
				}

				BasicBlock *Preheader = L->getLoopPreheader();

				// If we don't have a preheader, then insert one. If we already have a
				// preheader, then we can use it (except if the preheader contains a use of
				// the CTR register because some such uses might be reordered by the
				// selection DAG after the mtctr instruction).
				if (!Preheader)// \|\| mightUseCTR(Preheader))
				Preheader = InsertPreheaderForLoop(L, DT, LI, nullptr, PreserveLCSSA);
				if (!Preheader)
				return false;

				LLVM_DEBUG(dbgs() << "Preheader for exit count: " << Preheader->getName()
				<< "\n");


				ConvertLoop(HWLoopInfo);
				LLVM_DEBUG(dbgs() << "Converted Loop: " << *L << "\n");
				++NumHWLoops;
				return true;
				}

				static const SCEV* CalcTotalElts(ConstantInt *Factor,
				const SCEV *TripCount,
				ScalarEvolution &SE) {
				if (Factor->equalsInt(1))
				return TripCount;

				const SCEV *FactorSCEV = SE.getSCEV(Factor);
				IntegerType *Int32Ty = Factor->getType();

				if (auto *Count = dyn_cast<SCEVConstant>(TripCount)) {
				const SCEV *Elts = SE.getMulExpr(TripCount, FactorSCEV);
				unsigned Rem = Count->getAPInt().urem(Factor->getZExtValue());
				if (Rem == 0)
				return Elts;
				else
				return SE.getAddExpr(Elts, SE.getSCEV(ConstantInt::get(Int32Ty, Rem)));
				}

				auto VisitAdd = [&](const SCEVAddExpr S) -> const SCEVMulExpr {
				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: VisitAdd " << *S << "\n");
				if (auto *Const = dyn_cast<SCEVConstant>(S->getOperand(0))) {
				if (Const->getAPInt() != -Factor->getValue())
				return nullptr;
				} else
				return nullptr;
				return dyn_cast<SCEVMulExpr>(S->getOperand(1));
				};

				auto VisitMul = [&](const SCEVMulExpr S) -> const SCEVUDivExpr {
				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: VisitMul " << *S << "\n");
				if (auto *Const = dyn_cast<SCEVConstant>(S->getOperand(0))) {
				if (Const->getValue() != Factor)
				return nullptr;
				} else
				return nullptr;
				return dyn_cast<SCEVUDivExpr>(S->getOperand(1));
				};

				auto VisitDiv = [&](const SCEVUDivExpr S) -> const SCEV {
				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: VisitDiv " << *S << "\n");
				if (auto *Const = dyn_cast<SCEVConstant>(S->getRHS())) {
				if (Const->getValue() != Factor)
				return nullptr;
				} else
				return nullptr;

				if (auto *RoundUp = dyn_cast<SCEVAddExpr>(S->getLHS())) {
				if (auto *Const = dyn_cast<SCEVConstant>(RoundUp->getOperand(0))) {
				if (Const->getAPInt() != (Factor->getValue() - 1))
				return nullptr;
				} else
				return nullptr;

				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: Elements: "
				<< *RoundUp->getOperand(1) << "\n");
				return RoundUp->getOperand(1);
				}
				return nullptr;
				};

				// (1 + ((-4 + (4 * ((3 + %N) /u 4))<nuw>) /u 4))<nuw><nsw>
				if (auto *TC = dyn_cast<SCEVAddExpr>(TripCount))
				if (auto *Div = dyn_cast<SCEVUDivExpr>(TC->getOperand(1)))
				if (auto *Add = dyn_cast<SCEVAddExpr>(Div->getLHS()))
				if (auto *Mul = VisitAdd(Add))
				if (auto *Div = VisitMul(Mul))
				if (auto *Elts = VisitDiv(Div))
				return Elts;

				return nullptr;
				}

				// Insert the count into the preheader and replace the condition used by the
				// selected branch.
				void HardwareLoops::ConvertLoop(TTI::HardwareLoopInfo &HWLoopInfo) {

				auto InitLoopCount = [this](TTI::HardwareLoopInfo &HWLoopInfo,
				BasicBlock *BB) {
				const SCEV *ExitCount = HWLoopInfo.ExitCount;

				Type *CountType = HWLoopInfo.CountType;
				SCEVExpander SCEVE(SE, DL, "loopcnt");
				if (!ExitCount->getType()->isPointerTy() &&
				ExitCount->getType() != CountType)
				ExitCount = SE->getZeroExtendExpr(ExitCount, CountType);

				ExitCount = SE->getAddExpr(ExitCount, SE->getOne(CountType));

				if (HWLoopInfo.Predicate) {
				ConstantInt *Factor = cast<ConstantInt>(
				ConstantInt::get(ExitCount->getType(), HWLoopInfo.NumElements));
				ExitCount = CalcTotalElts(Factor, ExitCount, *SE);
				}

				return SCEVE.expandCodeFor(ExitCount, CountType, BB->getTerminator());
				};

				auto InsertIterationSetup = [this](TTI::HardwareLoopInfo &HWLoopInfo,
				Value LoopCountInit, BasicBlock BB) {
				IRBuilder<> Builder(BB->getTerminator());
				Type *Ty = LoopCountInit->getType();

				if (HWLoopInfo.PerformTest) {
				Function *LoopIter =
				Intrinsic::getDeclaration(M, Intrinsic::test_set_loop_iterations,
				{ Ty, Ty });
				Value *Call = Builder.CreateCall(LoopIter, LoopCountInit);
				LLVM_DEBUG(dbgs() << "HWLoops: Inserted loop setup: " << *Call << "\n");

				auto *LoopGuard = dyn_cast<BranchInst>(BB->getTerminator());
				assert((LoopGuard && LoopGuard->isConditional()) &&
				"Expected conditional branch for while loop");
				//Value *Cmp = Builder.CreateICmpNE(Call, ConstantInt::get(Ty, 0));
				LoopGuard->setCondition(Call);

				if (LoopGuard->getSuccessor(0) != HWLoopInfo.L->getLoopPreheader())
				LoopGuard->swapSuccessors();
				} else {
				Function *LoopIter =
				Intrinsic::getDeclaration(M, Intrinsic::set_loop_iterations, Ty);
				Builder.CreateCall(LoopIter, LoopCountInit);
				}
				};

				auto InsertElementSetup = [this](TTI::HardwareLoopInfo &HWLoopInfo,
				Value NumElts, BasicBlock BB) {
				Type *Ty = HWLoopInfo.CountType;
				IRBuilder<> Builder(BB->getTerminator());
				Value *Ops[] = { NumElts, ConstantInt::get(Ty, HWLoopInfo.NumElements) };


				if (HWLoopInfo.PerformTest) {
				Function *Setup =
				Intrinsic::getDeclaration(M, Intrinsic::test_set_loop_elements,
				{ Ty, Ty });
				Instruction *Call = Builder.CreateCall(Setup, Ops);
				LLVM_DEBUG(dbgs() << "HWLoops: Insert loop elements: " << *Call << "\n");

				auto *LoopGuard = dyn_cast<BranchInst>(BB->getTerminator());
				assert((LoopGuard && LoopGuard->isConditional()) &&
				"Expected conditional branch for while loop");
				//Value *Cmp = Builder.CreateICmpNE(Call, ConstantInt::get(Ty, 0));
				LoopGuard->setCondition(Call);

				if (LoopGuard->getSuccessor(0) != HWLoopInfo.L->getLoopPreheader())
				LoopGuard->swapSuccessors();
				} else {
				Function *Setup =
				Intrinsic::getDeclaration(M, Intrinsic::set_loop_elements,
				{ Ty, Ty });
				Builder.CreateCall(Setup, Ops);
				}
				};

				auto InsertCounterPHI = [](TTI::HardwareLoopInfo &HWLoopInfo,
				Value NumElts, Value EltsRem) {
				BasicBlock *Preheader = HWLoopInfo.L->getLoopPreheader();
				BasicBlock *Header = HWLoopInfo.L->getHeader();
				BasicBlock *Latch = HWLoopInfo.ExitBranch->getParent();
				IRBuilder<> Builder(Header->getFirstNonPHI());
				PHINode *Index = Builder.CreatePHI(NumElts->getType(), 2);
				Index->addIncoming(NumElts, Preheader);
				Index->addIncoming(EltsRem, Latch);
				LLVM_DEBUG(dbgs() << "HWLoops: Index PHI: " << *Index << "\n");
				return Index;
				};

				auto InsertDec = [this](TTI::HardwareLoopInfo &HWLoopInfo, Value *NumElts) {
				BranchInst *ExitBranch = HWLoopInfo.ExitBranch;
				IRBuilder<> CondBuilder(ExitBranch);
				Value *Factor = ConstantInt::get(NumElts->getType(),
				HWLoopInfo.NumElements);
				Function *DecFunc =
				Intrinsic::getDeclaration(M, Intrinsic::loop_dec,
				{ NumElts->getType(), NumElts->getType(),
				Factor->getType()});
				Value *Ops[] = { NumElts, Factor };
				Value *Call = CondBuilder.CreateCall(DecFunc, Ops);
				Value *NewCond =
				CondBuilder.CreateICmpNE(Call,
				ConstantInt::get(NumElts->getType(), 0));
				Value *OldCond = ExitBranch->getCondition();
				ExitBranch->setCondition(NewCond);

				// The false branch must exit the loop.
				if (!HWLoopInfo.L->contains(ExitBranch->getSuccessor(0)))
				ExitBranch->swapSuccessors();

				// The old condition may be dead now, and may have even created a dead PHI
				// (the original induction variable).
				RecursivelyDeleteTriviallyDeadInstructions(OldCond);

				LLVM_DEBUG(dbgs() << "HWLoops: Inserted loop dec: " << *Call << "\n");
				return cast<Instruction>(Call);
				};

				auto InsertActiveMask = [this](TTI::HardwareLoopInfo &HWLoopInfo,
				Value *Elts) {
				IRBuilder<> Builder(HWLoopInfo.Predicate);
				Function *F =
				Intrinsic::getDeclaration(M, Intrinsic::get_active_mask_4, Elts->getType());
				Value *Ops[] = { Elts };
				Instruction *ActiveMask = Builder.CreateCall(F, Ops);
				LLVM_DEBUG(dbgs() << "HWLoops: Active Lane Mask: " << *ActiveMask << "\n");
				HWLoopInfo.Predicate->replaceAllUsesWith(ActiveMask);
				};

				BasicBlock *BeginBB = HWLoopInfo.PerformTest ?
				HWLoopInfo.L->getLoopPreheader()->getUniquePredecessor() :
				HWLoopInfo.L->getLoopPreheader();

				Value *LoopCountInit = InitLoopCount(HWLoopInfo, BeginBB);
				Value *EltsRem = LoopCountInit;

				if (HWLoopInfo.Predicate) {
				InsertElementSetup(HWLoopInfo, LoopCountInit, BeginBB);
				} else
				InsertIterationSetup(HWLoopInfo, LoopCountInit, BeginBB);

				Instruction *LoopDec = InsertDec(HWLoopInfo, EltsRem);
				if (HWLoopInfo.InsertPHICounter) {
				EltsRem = InsertCounterPHI(HWLoopInfo, LoopCountInit, LoopDec);
				LoopDec->setOperand(0, EltsRem);
				}
				if (HWLoopInfo.Predicate)
				InsertActiveMask(HWLoopInfo, EltsRem);

				// Run through the basic blocks of the loop and see if any of them have dead
				// PHIs that can be removed.
				for (auto I : HWLoopInfo.L->blocks())
				DeleteDeadPHIs(I);
				}

				INITIALIZE_PASS_BEGIN(HardwareLoops, DEBUG_TYPE, HW_LOOPS_NAME, false, false)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
				INITIALIZE_PASS_END(HardwareLoops, DEBUG_TYPE, HW_LOOPS_NAME, false, false)

				FunctionPass *llvm::createHardwareLoops() { return new HardwareLoops(); }

lib/Target/ARM/ARM.h

	Show All 31 Lines
	class MachineBasicBlock;			class MachineBasicBlock;
	class MachineFunction;			class MachineFunction;
	class MachineInstr;			class MachineInstr;
	class MCInst;			class MCInst;
	class PassRegistry;			class PassRegistry;


	Pass *createARMParallelDSPPass();			Pass *createARMParallelDSPPass();
				FunctionPass *createARMFinaliseHardwareLoopsPass();
	FunctionPass *createARMISelDag(ARMBaseTargetMachine &TM,			FunctionPass *createARMISelDag(ARMBaseTargetMachine &TM,
	CodeGenOpt::Level OptLevel);			CodeGenOpt::Level OptLevel);
	FunctionPass *createA15SDOptimizerPass();			FunctionPass *createA15SDOptimizerPass();
	FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);			FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);
	FunctionPass *createARMExpandPseudoPass();			FunctionPass *createARMExpandPseudoPass();
	FunctionPass *createARMCodeGenPreparePass();			FunctionPass *createARMCodeGenPreparePass();
	FunctionPass *createARMConstantIslandPass();			FunctionPass *createARMConstantIslandPass();
	FunctionPass *createMLxExpansionPass();			FunctionPass *createMLxExpansionPass();
	Show All 27 Lines

lib/Target/ARM/ARMFinalizeHardwareLoops.cpp

This file was added.

				//===----------------------------------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "ARM.h"
				#include "ARMBaseInstrInfo.h"
				#include "ARMBaseRegisterInfo.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineLoopInfo.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"

				using namespace llvm;

				#define DEBUG_TYPE "arm-finalise-hardware-loops"
				#define ARM_FINALISE_HW_LOOPS_NAME "ARM hardware loop finalisation pass"

				namespace {

				class ARMFinaliseHWLoops : public MachineFunctionPass {
				const ARMBaseInstrInfo *TII = nullptr;

				public:
				static char ID;

				ARMFinaliseHWLoops() : MachineFunctionPass(ID) { }

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.setPreservesCFG();
				AU.addRequired<MachineLoopInfo>();
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				bool runOnMachineFunction(MachineFunction &MF) override;

				bool ProcessLoop(MachineLoop *ML);

				void Expand(MachineInstr Start, MachineInstr Dec, MachineInstr *End,
				MachineInstr *ActiveMask,
				SmallVectorImpl<MachineInstr*> &Predicated);

				MachineFunctionProperties getRequiredProperties() const override {
				return MachineFunctionProperties().set(
				MachineFunctionProperties::Property::NoVRegs);
				}

				StringRef getPassName() const override {
				return ARM_FINALISE_HW_LOOPS_NAME;
				}
				};
				}

				char ARMFinaliseHWLoops::ID = 0;

				bool ARMFinaliseHWLoops::runOnMachineFunction(MachineFunction &MF) {
				auto &MLI = getAnalysis<MachineLoopInfo>();
				TII =
				static_cast<const ARMBaseInstrInfo*>(MF.getSubtarget().getInstrInfo());
				LLVM_DEBUG(dbgs() << " ------- ARM HWLOOPS on " << MF.getName() << "\n");

				bool Changed = false;
				for (auto ML : MLI) {
				if (!ML->getExitingBlock() \|\| !ML->getHeader() \|\| !ML->getLoopLatch())
				continue;
				Changed \|= ProcessLoop(ML);
				}
				return Changed;
				}

				bool ARMFinaliseHWLoops::ProcessLoop(MachineLoop *ML) {

				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: Processing " << *ML);
				auto SearchForStart = [](MachineBasicBlock MBB) -> MachineInstr {
				for (auto &MI : *MBB) {
				if (MI.getOpcode() == ARM::t2LoopStart)
				return &MI;
				}
				return nullptr;
				};

				MachineInstr *Start = nullptr;

				if (auto *Preheader = ML->getLoopPreheader()) {
				Start = SearchForStart(Preheader);
				if (!Start) {
				if (Preheader->pred_size() == 1) {
				MachineBasicBlock PrePreheader = Preheader->pred_begin();
				Start = SearchForStart(PrePreheader);
				}
				}
				}

				if (!Start)
				return false;
				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: Found Loop Start: " << *Start);

				auto IsLoopDec = [](MachineInstr &MI) {
				return MI.getOpcode() == ARM::t2LoopDec;
				};

				auto IsLoopEnd = [](MachineInstr &MI) {
				return MI.getOpcode() == ARM::t2LoopEnd;
				};

				auto IsActiveMask = [](MachineInstr &MI) {
				return MI.getOpcode() == ARM::t2ActiveMask;
				};

				auto IsPredicated = [](MachineInstr &MI) {
				switch (MI.getOpcode()) {
				default:
				break;
				case ARM::VMSTR32:
				case ARM::VMLDR32:
				return true;
				}
				return false;
				};

				MachineInstr *Dec = nullptr;
				MachineInstr *End = nullptr;
				MachineInstr *ActiveMask = nullptr;
				bool FoundPredicated = false;
				bool IsProfitable = true;
				SmallVector<MachineInstr*, 4> Predicated;

				for (auto *MBB : ML->getBlocks()) {
				for (auto &MI : *MBB) {
				// TODO: For scalar loops, check for any instructions that means a
				// low-overhead loop wouldn't be profitable. Should we bail if LR has
				// been spilt? We'd still need a register to control the loop count but
				// the loop index may increase whereas LE(TP) decrement it...
				//
				// Not inserting a low-overhead loop for a vector loop is not really
				// option here as we'd either:
				// - Need to reconstruct a vector loop and a scalar epilogue.
				// - Try to use VIDUP and create a VPT block to predicate the lanes,
				// which would require using a Q register, all of which may be already
				// allocated, for the VIDUP result. It looks like VIDUP wouldn't even be
				// helpful for 16xi8 vectors because the instruction can only increment
				// by a maximum of 8.

				if (IsLoopDec(MI))
				Dec = &MI;
				else if (IsLoopEnd(MI))
				End = &MI;
				else if (IsActiveMask(MI))
				ActiveMask = &MI;
				else if (IsPredicated(MI)) {
				FoundPredicated = true;
				Predicated.push_back(&MI);
				}
				}
				}

				// Check that we've found the necessary components
				if (!Dec \|\| !End \|\| (FoundPredicated && !ActiveMask))
				return false;

				if (!IsProfitable)
				return false;

				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: Found Loop Dec: " << *Dec);
				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: Found Loop End: " << *End);

				// TODO: Verify that the cmp and br from the WLS either branch to the header
				// or the exit block.
				// TODO: Verify that the cmp and br from the LE either branch to the header
				// or the exit block.
				// TODO: Verify that all predicated instructions are using ActiveMask.

				Expand(Start, Dec, End, ActiveMask, Predicated);
				return true;
				}

				void ARMFinaliseHWLoops::Expand(MachineInstr Start, MachineInstr Dec,
				MachineInstr End, MachineInstr ActiveMask,
				SmallVectorImpl<MachineInstr*> &Predicated) {
				auto ExpandLoopStart = [this](MachineInstr *Start) {
				MachineBasicBlock &MBB = *Start->getParent();
				MachineInstrBuilder MIB = BuildMI(MBB, Start, Start->getDebugLoc(),
				TII->get(ARM::t2WLSTP));
				MIB.addDef(ARM::LR);
				unsigned OpIdx = 0;
				MIB.add(Start->getOperand(OpIdx++));
				MIB.add(Start->getOperand(OpIdx++));
				MIB.add(Start->getOperand(OpIdx++));
				MIB.add(predOps(ARMCC::AL));
				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: Inserted WLSTP: " << *MIB << "\n");
				Start->eraseFromParent();
				};

				auto ExpandLoad = [this](MachineInstr *MI) {
				MachineBasicBlock &MBB = *MI->getParent();
				MachineInstrBuilder MIB = BuildMI(MBB, MI, MI->getDebugLoc(),
				TII->get(ARM::t2VLDRW));
				unsigned OpIdx = 0;
				MIB.add(MI->getOperand(OpIdx++));
				MIB.add(MI->getOperand(OpIdx++));
				MIB.add(predOps(ARMCC::AL));
				MI->eraseFromParent();
				};

				auto ExpandStore = [this](MachineInstr *MI) {
				MachineBasicBlock &MBB = *MI->getParent();
				MachineInstrBuilder MIB = BuildMI(MBB, MI, MI->getDebugLoc(),
				TII->get(ARM::t2VSTRW));
				unsigned OpIdx = 0;
				MIB.add(MI->getOperand(OpIdx++));
				MIB.add(MI->getOperand(OpIdx++));
				MIB.add(predOps(ARMCC::AL));
				MI->eraseFromParent();
				};

				auto RemoveActiveMask = [](MachineInstr *MI) {
				MI->eraseFromParent();
				};

				// Combine the LoopDec and LoopEnd instructions into LE(TP).
				auto ExpandLoopEnd = [this](MachineInstr Dec, MachineInstr End) {
				// TODO: Check and handle the causes where LR is spilt between Dec and End.
				MachineBasicBlock &MBB = *End->getParent();
				MachineInstrBuilder MIB = BuildMI(MBB, End, End->getDebugLoc(),
				TII->get(ARM::t2LETP));
				MIB.addDef(ARM::LR);
				unsigned OpIdx = 0;
				MIB.add(End->getOperand(OpIdx++));
				MIB.add(End->getOperand(OpIdx++));
				MIB.add(predOps(ARMCC::AL));
				LLVM_DEBUG(dbgs() << "ARM HWLOOPS: Inserted LETP: " << *MIB << "\n");
				End->eraseFromParent();
				Dec->eraseFromParent();
				};

				ExpandLoopStart(Start);
				ExpandLoopEnd(Dec, End);

				if (ActiveMask) {
				for (auto *MI : Predicated) {
				if (MI->mayLoad())
				ExpandLoad(MI);
				else if (MI->mayStore())
				ExpandStore(MI);
				else
				llvm_unreachable("unhandled predicated instruction");
				}
				RemoveActiveMask(ActiveMask);
				}
				}

				FunctionPass *llvm::createARMFinaliseHardwareLoopsPass() {
				return new ARMFinaliseHWLoops();
				}

lib/Target/ARM/ARMISelDAGToDAG.cpp

	Show First 20 Lines • Show All 992 Lines • ▼ Show 20 Lines
	} else if (Subtarget->isThumb()) {			} else if (Subtarget->isThumb()) {
	if (tryT1IndexedLoad(N))			if (tryT1IndexedLoad(N))
	return;			return;
	} else if (tryARMIndexedLoad(N))			} else if (tryARMIndexedLoad(N))
	return;			return;
	// Other cases are autogenerated.			// Other cases are autogenerated.
	break;			break;
	}			}
				case ARMISD::WhileLoopStart: {
				SDValue Size = CurDAG->getTargetConstant(
				cast<ConstantSDNode>(N->getOperand(1))->getZExtValue(), dl, MVT::i32);
				SDValue Ops[] = { Size,
				N->getOperand(2),
				N->getOperand(3),
				N->getOperand(0) };
				SDNode *LoopStart =
				CurDAG->getMachineNode(ARM::t2LoopStart, dl, MVT::Other, Ops);
				ReplaceUses(N, LoopStart);
				CurDAG->RemoveDeadNode(N);
				return;
				}
	case ARMISD::BRCOND: {			case ARMISD::BRCOND: {
	// Pattern: (ARMbrcond:void (bb:Other):$dst, (imm:i32):$cc)			// Pattern: (ARMbrcond:void (bb:Other):$dst, (imm:i32):$cc)
	// Emits: (Bcc:void (bb:Other):$dst, (imm:i32):$cc)			// Emits: (Bcc:void (bb:Other):$dst, (imm:i32):$cc)
	// Pattern complexity = 6 cost = 1 size = 0			// Pattern complexity = 6 cost = 1 size = 0

	// Pattern: (ARMbrcond:void (bb:Other):$dst, (imm:i32):$cc)			// Pattern: (ARMbrcond:void (bb:Other):$dst, (imm:i32):$cc)
	// Emits: (tBcc:void (bb:Other):$dst, (imm:i32):$cc)			// Emits: (tBcc:void (bb:Other):$dst, (imm:i32):$cc)
	// Pattern complexity = 6 cost = 1 size = 0			// Pattern complexity = 6 cost = 1 size = 0
	Show All 10 Lines
	SDValue N3 = N->getOperand(3);			SDValue N3 = N->getOperand(3);
	SDValue InFlag = N->getOperand(4);			SDValue InFlag = N->getOperand(4);
	assert(N1.getOpcode() == ISD::BasicBlock);			assert(N1.getOpcode() == ISD::BasicBlock);
	assert(N2.getOpcode() == ISD::Constant);			assert(N2.getOpcode() == ISD::Constant);
	assert(N3.getOpcode() == ISD::Register);			assert(N3.getOpcode() == ISD::Register);

	unsigned CC = (unsigned) cast<ConstantSDNode>(N2)->getZExtValue();			unsigned CC = (unsigned) cast<ConstantSDNode>(N2)->getZExtValue();

	if (InFlag.getOpcode() == ARMISD::CMPZ) {			// Handle loops.
				if (InFlag.getOperand(0).getOpcode() == ISD::INTRINSIC_W_CHAIN) {
				if (InFlag.getOpcode() == ARMISD::CMPZ) {
				// Handle loops.
				SDValue Int = InFlag.getOperand(0);
				LLVM_DEBUG(dbgs() << "Int: "; Int.dump());
				uint64_t ID = cast<ConstantSDNode>(Int->getOperand(1))->getZExtValue();

				if (ID == Intrinsic::loop_dec) {
				SDValue Elements = Int.getOperand(2);
				SDValue Size = CurDAG->getTargetConstant(
				cast<ConstantSDNode>(Int.getOperand(3))->getZExtValue(), dl,
				MVT::i32);

				SDValue Args[] = { Elements, Size, Int.getOperand(0) };
				SDNode *LoopDec =
				CurDAG->getMachineNode(ARM::t2LoopDec, dl,
				CurDAG->getVTList(MVT::i32, MVT::Other),
				Args);
				ReplaceUses(Int.getNode(), LoopDec);

				SDValue EndArgs[] = { SDValue(LoopDec, 0), N1, Chain };
				SDNode *LoopEnd =
				CurDAG->getMachineNode(ARM::t2LoopEnd, dl, MVT::Other, EndArgs);

				ReplaceUses(N, LoopEnd);
				CurDAG->RemoveDeadNode(N);
				CurDAG->RemoveDeadNode(InFlag.getNode());
				CurDAG->RemoveDeadNode(Int.getNode());
				return;
				}
				}

	bool SwitchEQNEToPLMI;			bool SwitchEQNEToPLMI;
	SelectCMPZ(InFlag.getNode(), SwitchEQNEToPLMI);			SelectCMPZ(InFlag.getNode(), SwitchEQNEToPLMI);
	InFlag = N->getOperand(4);			InFlag = N->getOperand(4);

	if (SwitchEQNEToPLMI) {			if (SwitchEQNEToPLMI) {
	switch ((ARMCC::CondCodes)CC) {			switch ((ARMCC::CondCodes)CC) {
	default: llvm_unreachable("CMPZ must be either NE or EQ!");			default: llvm_unreachable("CMPZ must be either NE or EQ!");
	case ARMCC::NE:			case ARMCC::NE:
	▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 227 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {

// Vector bitwise select		// Vector bitwise select
VBSL,		VBSL,

// Pseudo-instruction representing a memory copy using ldm/stm		// Pseudo-instruction representing a memory copy using ldm/stm
// instructions.		// instructions.
MEMCPY,		MEMCPY,

		WhileLoopStart,

// Vector load N-element structure to all lanes:		// Vector load N-element structure to all lanes:
VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,		VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
VLD2DUP,		VLD2DUP,
VLD3DUP,		VLD3DUP,
VLD4DUP,		VLD4DUP,

// NEON loads with post-increment base updates:		// NEON loads with post-increment base updates:
VLD1_UPD,		VLD1_UPD,
▲ Show 20 Lines • Show All 590 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

	Show First 20 Lines • Show All 520 Lines • ▼ Show 20 Lines
	setOperationAction(ISD::BITCAST, MVT::i16, Custom);			setOperationAction(ISD::BITCAST, MVT::i16, Custom);
	setOperationAction(ISD::BITCAST, MVT::i32, Custom);			setOperationAction(ISD::BITCAST, MVT::i32, Custom);
	setOperationAction(ISD::BITCAST, MVT::f16, Custom);			setOperationAction(ISD::BITCAST, MVT::f16, Custom);

	setOperationAction(ISD::FMINNUM, MVT::f16, Legal);			setOperationAction(ISD::FMINNUM, MVT::f16, Legal);
	setOperationAction(ISD::FMAXNUM, MVT::f16, Legal);			setOperationAction(ISD::FMAXNUM, MVT::f16, Legal);
	}			}

				const MVT pTypes[] = { MVT::v16i1, MVT::v8i1, MVT::v4i1 };
				for (auto VT : pTypes)
				addRegisterClass(VT, &ARM::VCCRRegClass);

	for (MVT VT : MVT::vector_valuetypes()) {			for (MVT VT : MVT::vector_valuetypes()) {
	for (MVT InnerVT : MVT::vector_valuetypes()) {			for (MVT InnerVT : MVT::vector_valuetypes()) {
	setTruncStoreAction(VT, InnerVT, Expand);			setTruncStoreAction(VT, InnerVT, Expand);
	setLoadExtAction(ISD::SEXTLOAD, VT, InnerVT, Expand);			setLoadExtAction(ISD::SEXTLOAD, VT, InnerVT, Expand);
	setLoadExtAction(ISD::ZEXTLOAD, VT, InnerVT, Expand);			setLoadExtAction(ISD::ZEXTLOAD, VT, InnerVT, Expand);
	setLoadExtAction(ISD::EXTLOAD, VT, InnerVT, Expand);			setLoadExtAction(ISD::EXTLOAD, VT, InnerVT, Expand);
	}			}

	▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v2i64, Custom);			setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v2i64, Custom);

	// NEON only has FMA instructions as of VFP4.			// NEON only has FMA instructions as of VFP4.
	if (!Subtarget->hasVFP4()) {			if (!Subtarget->hasVFP4()) {
	setOperationAction(ISD::FMA, MVT::v2f32, Expand);			setOperationAction(ISD::FMA, MVT::v2f32, Expand);
	setOperationAction(ISD::FMA, MVT::v4f32, Expand);			setOperationAction(ISD::FMA, MVT::v4f32, Expand);
	}			}

				setTargetDAGCombine(ISD::BRCOND);
	setTargetDAGCombine(ISD::INTRINSIC_VOID);			setTargetDAGCombine(ISD::INTRINSIC_VOID);
	setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);			setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
	setTargetDAGCombine(ISD::INTRINSIC_WO_CHAIN);			setTargetDAGCombine(ISD::INTRINSIC_WO_CHAIN);
	setTargetDAGCombine(ISD::SHL);			setTargetDAGCombine(ISD::SHL);
	setTargetDAGCombine(ISD::SRL);			setTargetDAGCombine(ISD::SRL);
	setTargetDAGCombine(ISD::SRA);			setTargetDAGCombine(ISD::SRA);
	setTargetDAGCombine(ISD::SIGN_EXTEND);			setTargetDAGCombine(ISD::SIGN_EXTEND);
	setTargetDAGCombine(ISD::ZERO_EXTEND);			setTargetDAGCombine(ISD::ZERO_EXTEND);
	▲ Show 20 Lines • Show All 1,984 Lines • ▼ Show 20 Lines
	V = DAG.getNode(ARMISD::BFI, dl, VT, V, X,			V = DAG.getNode(ARMISD::BFI, dl, VT, V, X,
	// Confusingly, the operand is an inverted mask.			// Confusingly, the operand is an inverted mask.
	DAG.getConstant(~Mask, dl, VT));			DAG.getConstant(~Mask, dl, VT));
	}			}

	return V;			return V;
	}			}

				static SDValue PerformHWLoopCombine(SDNode *N,
				TargetLowering::DAGCombinerInfo &DCI,
				const ARMSubtarget *ST) {
				SDValue CC = N->getOperand(1);

				if (CC->getOperand(0)->getOpcode() != ISD::INTRINSIC_W_CHAIN)
				return SDValue();

				SDValue Int = CC->getOperand(0);
				unsigned IntOp = cast<ConstantSDNode>(Int.getOperand(1))->getZExtValue();
				if (IntOp != Intrinsic::test_set_loop_elements)
				return SDValue();

				SDValue Chain = N->getOperand(0);
				SDValue Elements = Int.getOperand(2);
				SDValue Size = Int.getOperand(3);
				SDValue ExitBlock = N->getOperand(2);
				SDLoc dl(Int);

				SDValue Ops[] = { Chain, Size, Elements, ExitBlock };
				SDValue Res = DCI.DAG.getNode(ARMISD::WhileLoopStart, dl, MVT::Other, Ops);
				DCI.DAG.ReplaceAllUsesOfValueWith(Int.getValue(1), Int.getOperand(0));
				return Res;
				}

	/// PerformBRCONDCombine - Target-specific DAG combining for ARMISD::BRCOND.			/// PerformBRCONDCombine - Target-specific DAG combining for ARMISD::BRCOND.
	SDValue			SDValue
	ARMTargetLowering::PerformBRCONDCombine(SDNode *N, SelectionDAG &DAG) const {			ARMTargetLowering::PerformBRCONDCombine(SDNode *N, SelectionDAG &DAG) const {
	SDValue Cmp = N->getOperand(4);			SDValue Cmp = N->getOperand(4);
	if (Cmp.getOpcode() != ARMISD::CMPZ)			if (Cmp.getOpcode() != ARMISD::CMPZ)
	// Only looking at NE cases.			// Only looking at NE cases.
	return SDValue();			return SDValue();

	EVT VT = N->getValueType(0);			EVT VT = N->getValueType(0);
	SDLoc dl(N);			SDLoc dl(N);
	SDValue LHS = Cmp.getOperand(0);			SDValue LHS = Cmp.getOperand(0);
	SDValue RHS = Cmp.getOperand(1);			SDValue RHS = Cmp.getOperand(1);
	SDValue Chain = N->getOperand(0);			SDValue Chain = N->getOperand(0);
	SDValue BB = N->getOperand(1);			SDValue BB = N->getOperand(1);

	SDValue ARMcc = N->getOperand(2);			SDValue ARMcc = N->getOperand(2);
	ARMCC::CondCodes CC =			ARMCC::CondCodes CC =
	(ARMCC::CondCodes)cast<ConstantSDNode>(ARMcc)->getZExtValue();			(ARMCC::CondCodes)cast<ConstantSDNode>(ARMcc)->getZExtValue();

	// (brcond Chain BB ne CPSR (cmpz (and (cmov 0 1 CC CPSR Cmp) 1) 0))			// (brcond Chain BB ne CPSR (cmpz (and (cmov 0 1 CC CPSR Cmp) 1) 0))
	// -> (brcond Chain BB CC CPSR Cmp)			// -> (brcond Chain BB CC CPSR Cmp)
	if (CC == ARMCC::NE && LHS.getOpcode() == ISD::AND && LHS->hasOneUse() &&			if (CC == ARMCC::NE && LHS.getOpcode() == ISD::AND && LHS->hasOneUse() &&
	LHS->getOperand(0)->getOpcode() == ARMISD::CMOV &&			LHS->getOperand(0)->getOpcode() == ARMISD::CMOV &&
	▲ Show 20 Lines • Show All 201 Lines • ▼ Show 20 Lines
	case ARMISD::ADDE: return PerformADDECombine(N, DCI, Subtarget);			case ARMISD::ADDE: return PerformADDECombine(N, DCI, Subtarget);
	case ARMISD::UMLAL: return PerformUMLALCombine(N, DCI.DAG, Subtarget);			case ARMISD::UMLAL: return PerformUMLALCombine(N, DCI.DAG, Subtarget);
	case ISD::ADD: return PerformADDCombine(N, DCI, Subtarget);			case ISD::ADD: return PerformADDCombine(N, DCI, Subtarget);
	case ISD::SUB: return PerformSUBCombine(N, DCI);			case ISD::SUB: return PerformSUBCombine(N, DCI);
	case ISD::MUL: return PerformMULCombine(N, DCI, Subtarget);			case ISD::MUL: return PerformMULCombine(N, DCI, Subtarget);
	case ISD::OR: return PerformORCombine(N, DCI, Subtarget);			case ISD::OR: return PerformORCombine(N, DCI, Subtarget);
	case ISD::XOR: return PerformXORCombine(N, DCI, Subtarget);			case ISD::XOR: return PerformXORCombine(N, DCI, Subtarget);
	case ISD::AND: return PerformANDCombine(N, DCI, Subtarget);			case ISD::AND: return PerformANDCombine(N, DCI, Subtarget);
				case ISD::BRCOND: return PerformHWLoopCombine(N, DCI, Subtarget);
	case ARMISD::ADDC:			case ARMISD::ADDC:
	case ARMISD::SUBC: return PerformAddcSubcCombine(N, DCI, Subtarget);			case ARMISD::SUBC: return PerformAddcSubcCombine(N, DCI, Subtarget);
	case ARMISD::SUBE: return PerformAddeSubeCombine(N, DCI, Subtarget);			case ARMISD::SUBE: return PerformAddeSubeCombine(N, DCI, Subtarget);
	case ARMISD::BFI: return PerformBFICombine(N, DCI);			case ARMISD::BFI: return PerformBFICombine(N, DCI);
	case ARMISD::VMOVRRD: return PerformVMOVRRDCombine(N, DCI, Subtarget);			case ARMISD::VMOVRRD: return PerformVMOVRRDCombine(N, DCI, Subtarget);
	case ARMISD::VMOVDRR: return PerformVMOVDRRCombine(N, DCI.DAG);			case ARMISD::VMOVDRR: return PerformVMOVDRRCombine(N, DCI.DAG);
	case ISD::STORE: return PerformSTORECombine(N, DCI);			case ISD::STORE: return PerformSTORECombine(N, DCI);
	case ISD::BUILD_VECTOR: return PerformBUILD_VECTORCombine(N, DCI, Subtarget);			case ISD::BUILD_VECTOR: return PerformBUILD_VECTORCombine(N, DCI, Subtarget);
	▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/ARM/ARMInstrInfo.td

	//===- ARMInstrInfo.td - Target Description for ARM Target -- tablegen --===//			//===- ARMInstrInfo.td - Target Description for ARM Target -- tablegen --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file describes the ARM instructions in TableGen format.			// This file describes the ARM instructions in TableGen format.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// ARM specific DAG Nodes.			// ARM specific DAG Nodes.
	//			//
				def SDT_ARMWhileLoop : SDTypeProfile<0, 3, [SDTCisVT<0, i32>,
				SDTCisVT<1, i32>,
				SDTCisVT<2, OtherVT>]>;

				def ARMWLS : SDNode<"ARMISD::WhileLoopStart", SDT_ARMWhileLoop,
				[SDNPHasChain]>;

	// Type profiles.			// Type profiles.
	def SDT_ARMCallSeqStart : SDCallSeqStart<[ SDTCisVT<0, i32>,			def SDT_ARMCallSeqStart : SDCallSeqStart<[ SDTCisVT<0, i32>,
	SDTCisVT<1, i32> ]>;			SDTCisVT<1, i32> ]>;
	def SDT_ARMCallSeqEnd : SDCallSeqEnd<[ SDTCisVT<0, i32>, SDTCisVT<1, i32> ]>;			def SDT_ARMCallSeqEnd : SDCallSeqEnd<[ SDTCisVT<0, i32>, SDTCisVT<1, i32> ]>;
	def SDT_ARMStructByVal : SDTypeProfile<0, 4,			def SDT_ARMStructByVal : SDTypeProfile<0, 4,
	[SDTCisVT<0, i32>, SDTCisVT<1, i32>,			[SDTCisVT<0, i32>, SDTCisVT<1, i32>,
	SDTCisVT<2, i32>, SDTCisVT<3, i32>]>;			SDTCisVT<2, i32>, SDTCisVT<3, i32>]>;
	▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/ARM/ARMInstrThumb2.td

	Show First 20 Lines • Show All 992 Lines • ▼ Show 20 Lines
	def t2LEApcrel : t2PseudoInst<(outs rGPR:$Rd), (ins i32imm:$label, pred:$p),			def t2LEApcrel : t2PseudoInst<(outs rGPR:$Rd), (ins i32imm:$label, pred:$p),
	4, IIC_iALUi, []>, Sched<[WriteALU, ReadALU]>;			4, IIC_iALUi, []>, Sched<[WriteALU, ReadALU]>;
	let hasSideEffects = 1 in			let hasSideEffects = 1 in
	def t2LEApcrelJT : t2PseudoInst<(outs rGPR:$Rd),			def t2LEApcrelJT : t2PseudoInst<(outs rGPR:$Rd),
	(ins i32imm:$label, pred:$p),			(ins i32imm:$label, pred:$p),
	4, IIC_iALUi,			4, IIC_iALUi,
	[]>, Sched<[WriteALU, ReadALU]>;			[]>, Sched<[WriteALU, ReadALU]>;

				let isBranch = 1, isTerminator = 1, hasSideEffects = 1 in {
				def t2LoopStart :
				t2PseudoInst<(outs),
				(ins imm0_7:$size, rGPR:$elts, brtarget:$target),
				4, IIC_Br, []>, Sched<[WriteBr]>;
				def t2WLSTP :
				T2I<(outs GPRlr:$Rm), (ins imm0_7:$size, GPRlr:$elts, brtarget:$target), IIC_Br,
				"wlstp.$size", "\t$Rm, $elts, $target", []>, Sched<[WriteBr]> {
				bits<5> Rm;
				bits<2> size;
				bits<5> elts;
				bits<12> target;
				}
				}

				def t2LoopDec :
				t2PseudoInst<(outs GPRlr:$Rm),
				(ins GPRlr:$Rn, imm0_7:$size),
				4, IIC_Br,
				[]>,
				Sched<[WriteBr]>;

				let isBranch = 1, isTerminator = 1, hasSideEffects = 1 in {
				def t2LoopEnd :
				t2PseudoInst<(outs),
				(ins GPRlr:$elts, brtarget:$target),
				4, IIC_Br, []>, Sched<[WriteBr]>;
				def t2LETP :
				T2I<(outs GPRlr:$Rm), (ins GPRlr:$elts, brtarget:$target), IIC_Br,
				"letp", "\t$target", []>, Sched<[WriteBr]> {
				bits<5> Rm;
				bits<5> elts;
				bits<12> target;
				}
				}

				def t2ActiveMask :
				t2PseudoInst<(outs VCCR:$pred),
				(ins rGPR:$elts),
				4, IIC_Br,
				[(set VCCR:$pred, (int_get_active_mask_4 rGPR:$elts))]>,
				Sched<[WriteBr]>;

				def nonext_masked_load :
				PatFrag<(ops node:$ptr, node:$pred, node:$def),
				(masked_load node:$ptr, node:$pred, node:$def), [{
				return cast<MaskedLoadSDNode>(N)->getExtensionType() == ISD::NON_EXTLOAD;
				}]>;
				def nontrunc_masked_store :
				PatFrag<(ops node:$val, node:$ptr, node:$pred),
				(masked_store node:$val, node:$ptr, node:$pred), [{
				return !cast<MaskedStoreSDNode>(N)->isTruncatingStore();
				}]>;

				def VMLDR32 : t2PseudoInst<(outs QPR:$vec),
				(ins t2addrmode_imm12:$addr, VCCR:$pred, i32imm:$imm), 4,
				NoItinerary, []>, Sched<[WriteLd]>;
				let mayLoad = 1 in
				def t2VLDRW : T2I<(outs QPR:$Rm),
				(ins rGPR:$addr), NoItinerary,
				"vldrw", "\t$Rm, [$addr]", []>, Sched<[WriteLd]> {
				bits<6> Rm;
				bits<5> addr;
				}

				def VMSTR32 : t2PseudoInst<(outs),
				(ins QPR:$vec, t2addrmode_imm12:$addr, VCCR:$pred, i32imm:$imm), 4,
				NoItinerary, []>, Sched<[WriteST]>;
				let mayStore = 1 in
				def t2VSTRW : T2I<(outs),
				(ins QPR:$Rm, rGPR:$addr), NoItinerary,
				"vstrw", "\t$Rm, [$addr]", []>, Sched<[WriteST]> {
				bits<6> Rm;
				bits<5> addr;
				}

				def : Pat<(v4i32 (nonext_masked_load rGPR:$addr, (v4i1 VCCR:$pred), undef)),
				(v4i32 (VMLDR32 rGPR:$addr, (i32 0), (v4i1 VCCR:$pred), (i32 2)))>;
				def : Pat<(nontrunc_masked_store (v4i32 QPR:$vec), rGPR:$addr, (v4i1 VCCR:$pred)),
				(VMSTR32 (v4i32 QPR:$vec), rGPR:$addr, (i32 0), (v4i1 VCCR:$pred),
				(i32 2))>;



	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Load / store Instructions.			// Load / store Instructions.
	//			//

	// Load			// Load
	let canFoldAsLoad = 1, isReMaterializable = 1 in			let canFoldAsLoad = 1, isReMaterializable = 1 in
	defm t2LDR : T2I_ld<0, 0b10, "ldr", IIC_iLoad_i, IIC_iLoad_si, GPR, load>;			defm t2LDR : T2I_ld<0, 0b10, "ldr", IIC_iLoad_i, IIC_iLoad_si, GPR, load>;
	▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/ARM/ARMRegisterInfo.td

	Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines
	// implied SP argument list.			// implied SP argument list.
	// FIXME: It would be better to not use this at all and refactor the			// FIXME: It would be better to not use this at all and refactor the
	// instructions to not have SP an an explicit argument. That makes			// instructions to not have SP an an explicit argument. That makes
	// frame index resolution a bit trickier, though.			// frame index resolution a bit trickier, though.
	def GPRsp : RegisterClass<"ARM", [i32], 32, (add SP)> {			def GPRsp : RegisterClass<"ARM", [i32], 32, (add SP)> {
	let DiagnosticString = "operand must be a register sp";			let DiagnosticString = "operand must be a register sp";
	}			}

				def GPRlr : RegisterClass<"ARM", [i32], 32, (add LR)>;

				def VPR : ARMReg<32, "vpr">;
				def VCCR : RegisterClass<"ARM", [i32, v16i1, v8i1, v4i1], 32, (add VPR)>;

	// restricted GPR register class. Many Thumb2 instructions allow the full			// restricted GPR register class. Many Thumb2 instructions allow the full
	// register range for operands, but have undefined behaviours when PC			// register range for operands, but have undefined behaviours when PC
	// or SP (R13 or R15) are used. The ARM ISA refers to these operands			// or SP (R13 or R15) are used. The ARM ISA refers to these operands
	// via the BadReg() pseudo-code description.			// via the BadReg() pseudo-code description.
	def rGPR : RegisterClass<"ARM", [i32], 32, (sub GPR, SP, PC)> {			def rGPR : RegisterClass<"ARM", [i32], 32, (sub GPR, SP, PC)> {
	let AltOrders = [(add LR, rGPR), (trunc rGPR, 8)];			let AltOrders = [(add LR, rGPR), (trunc rGPR, 8)];
	let AltOrderSelect = [{			let AltOrderSelect = [{
	return 1 + MF.getSubtarget<ARMSubtarget>().isThumb1Only();			return 1 + MF.getSubtarget<ARMSubtarget>().isThumb1Only();
	▲ Show 20 Lines • Show All 224 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetMachine.cpp

Show First 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)
addPass(createCFGSimplificationPass(		addPass(createCFGSimplificationPass(
1, false, false, true, true, [this](const Function &F) {		1, false, false, true, true, [this](const Function &F) {
const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);		const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);
return ST.hasAnyDataBarrier() && !ST.isThumb1Only();		return ST.hasAnyDataBarrier() && !ST.isThumb1Only();
}));		}));

TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();

		addPass(createHardwareLoops());
		addPass(createDeadCodeEliminationPass());

// Run the parallel DSP pass.		// Run the parallel DSP pass.
if (getOptLevel() == CodeGenOpt::Aggressive)		if (getOptLevel() == CodeGenOpt::Aggressive)
addPass(createARMParallelDSPPass());		addPass(createARMParallelDSPPass());

// Match interleaved memory accesses to ldN/stN intrinsics.		// Match interleaved memory accesses to ldN/stN intrinsics.
if (TM->getOptLevel() != CodeGenOpt::None)		if (TM->getOptLevel() != CodeGenOpt::None)
addPass(createInterleavedAccessPass());		addPass(createInterleavedAccessPass());
}		}
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	void ARMPassConfig::addPreSched2() {
if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
if (EnableARMLoadStoreOpt)		if (EnableARMLoadStoreOpt)
addPass(createARMLoadStoreOptimizationPass());		addPass(createARMLoadStoreOptimizationPass());

addPass(new ARMExecutionDomainFix());		addPass(new ARMExecutionDomainFix());
addPass(createBreakFalseDeps());		addPass(createBreakFalseDeps());
}		}

		addPass(createARMFinaliseHardwareLoopsPass());

// Expand some pseudo instructions into multiple instructions to allow		// Expand some pseudo instructions into multiple instructions to allow
// proper scheduling.		// proper scheduling.
addPass(createARMExpandPseudoPass());		addPass(createARMExpandPseudoPass());

if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
// in v8, IfConversion depends on Thumb instruction widths		// in v8, IfConversion depends on Thumb instruction widths
addPass(createThumb2SizeReductionPass([this](const Function &F) {		addPass(createThumb2SizeReductionPass([this](const Function &F) {
return this->TM->getSubtarget<ARMSubtarget>(F).restrictIT();		return this->TM->getSubtarget<ARMSubtarget>(F).restrictIT();
Show All 23 Lines

lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace, const Instruction *I = nullptr);		unsigned AddressSpace, const Instruction *I = nullptr);

int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace,		unsigned AddressSpace,
bool UseMaskForCond = false,		bool UseMaskForCond = false,
bool UseMaskForGaps = false);		bool UseMaskForGaps = false);

		bool isLegalMaskedStore(Type *Ty) { return true; }

		bool isLegalMaskedLoad(Type *Ty) { return true; }

		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		TTI::HardwareLoopInfo &HWLoopInfo);

void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

bool shouldBuildLookupTablesForConstant(Constant *C) const {		bool shouldBuildLookupTablesForConstant(Constant *C) const {
// In the ROPI and RWPI relocation models we can't have pointers to global		// In the ROPI and RWPI relocation models we can't have pointers to global
// variables or functions in constant data, so don't convert switches to		// variables or functions in constant data, so don't convert switches to
// lookup tables if any of the values would need relocation.		// lookup tables if any of the values would need relocation.
if (ST->isROPI() \|\| ST->isRWPI())		if (ST->isROPI() \|\| ST->isRWPI())
Show All 10 Lines

lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 622 Lines • ▼ Show 20 Lines	if (NumElts % Factor == 0 &&
return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL);		return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL);
}		}

return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace,		Alignment, AddressSpace,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}

		bool ARMTTIImpl::isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		TTI::HardwareLoopInfo &HWLoopInfo) {
		if (!L->getExitBlock() \|\| !SE.getBackedgeTakenCount(L))
		return false;

		const SCEV *BackedgeTakenCount = SE.getBackedgeTakenCount(L);
		if (isa<SCEVCouldNotCompute>(BackedgeTakenCount))
		return false;

		const SCEV *TripCountSCEV =
		SE.getAddExpr(BackedgeTakenCount,
		SE.getOne(BackedgeTakenCount->getType()));

		if (SE.getUnsignedRangeMax(TripCountSCEV).getBitWidth() > 32)
		return false;

		auto CheckForPredicates = [&HWLoopInfo](Loop *L) {
		VectorType *VecTy = nullptr;
		// Inspect the instructions for vector operations.
		for (auto *BB : L->getBlocks()) {
		for (auto &I : *BB) {
		if (!isa<VectorType>(I.getType()))
		continue;

		auto *VTy = cast<VectorType>(I.getType());
		if (!VecTy)
		VecTy = VTy;
		else if (VecTy->getNumElements() != VTy->getNumElements())
		return false;

		if (!isa<IntrinsicInst>(&I))
		continue;

		auto *Call = dyn_cast<IntrinsicInst>(&I);
		if (Call->getIntrinsicID() != Intrinsic::masked_load &&
		Call->getIntrinsicID() != Intrinsic::masked_store)
		continue;

		if (!HWLoopInfo.Predicate)
		HWLoopInfo.Predicate = cast<Instruction>(Call->getOperand(2));
		else if (HWLoopInfo.Predicate != cast<Instruction>(Call->getOperand(2)))
		return false;
		}
		}
		return true;
		};

		if (!CheckForPredicates(L))
		return false;

		BasicBlock *Preheader = L->getLoopPreheader();
		if (auto *BI = dyn_cast<BranchInst>(Preheader->getTerminator()))
		if (BI->isUnconditional() && Preheader->getUniquePredecessor())
		HWLoopInfo.PerformTest = true;

		LLVMContext &C = L->getHeader()->getParent()->getParent()->getContext();
		HWLoopInfo.InsertPHICounter = true;
		HWLoopInfo.CountType = Type::getInt32Ty(C);
		HWLoopInfo.NumElements = 4;
		return true;
		}

void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// Only currently enable these preferences for M-Class cores.		// Only currently enable these preferences for M-Class cores.
if (!ST->isMClass())		if (!ST->isMClass())
return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);		return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);

// Disable loop unrolling for Oz and Os.		// Disable loop unrolling for Oz and Os.
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

lib/Target/ARM/CMakeLists.txt

Show All 23 Lines	add_llvm_target(ARMCodeGen
ARMBaseRegisterInfo.cpp		ARMBaseRegisterInfo.cpp
ARMCallingConv.cpp		ARMCallingConv.cpp
ARMCallLowering.cpp		ARMCallLowering.cpp
ARMCodeGenPrepare.cpp		ARMCodeGenPrepare.cpp
ARMConstantIslandPass.cpp		ARMConstantIslandPass.cpp
ARMConstantPoolValue.cpp		ARMConstantPoolValue.cpp
ARMExpandPseudoInsts.cpp		ARMExpandPseudoInsts.cpp
ARMFastISel.cpp		ARMFastISel.cpp
		ARMFinalizeHardwareLoops.cpp
ARMFrameLowering.cpp		ARMFrameLowering.cpp
ARMHazardRecognizer.cpp		ARMHazardRecognizer.cpp
ARMInstructionSelector.cpp		ARMInstructionSelector.cpp
ARMISelDAGToDAG.cpp		ARMISelDAGToDAG.cpp
ARMISelLowering.cpp		ARMISelLowering.cpp
ARMInstrInfo.cpp		ARMInstrInfo.cpp
ARMLegalizerInfo.cpp		ARMLegalizerInfo.cpp
ARMParallelDSP.cpp		ARMParallelDSP.cpp
Show All 27 Lines

lib/Target/PowerPC/PPCCTRLoops.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "ctrloops"		#define DEBUG_TYPE "ctrloops"

#ifndef NDEBUG		#ifndef NDEBUG
static cl::opt<int> CTRLoopLimit("ppc-max-ctrloop", cl::Hidden, cl::init(-1));		static cl::opt<int> CTRLoopLimit("ppc-max-ctrloop", cl::Hidden, cl::init(-1));
#endif		#endif

// The latency of mtctr is only justified if there are more than 4
// comparisons that will be removed as a result.
static cl::opt<unsigned>
SmallCTRLoopThreshold("min-ctr-loop-threshold", cl::init(4), cl::Hidden,
cl::desc("Loops with a constant trip count smaller than "
"this value will not use the count register."));

STATISTIC(NumCTRLoops, "Number of loops converted to CTR loops");

namespace {		namespace {
struct PPCCTRLoops : public FunctionPass {

#ifndef NDEBUG
static int Counter;
#endif

public:
static char ID;

PPCCTRLoops() : FunctionPass(ID) {
initializePPCCTRLoopsPass(*PassRegistry::getPassRegistry());
}

bool runOnFunction(Function &F) override;

void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<LoopInfoWrapperPass>();
AU.addPreserved<LoopInfoWrapperPass>();
AU.addRequired<DominatorTreeWrapperPass>();
AU.addPreserved<DominatorTreeWrapperPass>();
AU.addRequired<ScalarEvolutionWrapperPass>();
AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<TargetTransformInfoWrapperPass>();
}

private:
bool mightUseCTR(BasicBlock *BB);
bool convertToCTRLoop(Loop *L);

private:
const PPCTargetMachine *TM;
const PPCSubtarget *STI;
const PPCTargetLowering *TLI;
const DataLayout *DL;
const TargetLibraryInfo *LibInfo;
const TargetTransformInfo *TTI;
LoopInfo *LI;
ScalarEvolution *SE;
DominatorTree *DT;
bool PreserveLCSSA;
TargetSchedModel SchedModel;
};

char PPCCTRLoops::ID = 0;
#ifndef NDEBUG
int PPCCTRLoops::Counter = 0;
#endif

#ifndef NDEBUG		#ifndef NDEBUG
struct PPCCTRLoopsVerify : public MachineFunctionPass {		struct PPCCTRLoopsVerify : public MachineFunctionPass {
public:		public:
static char ID;		static char ID;

PPCCTRLoopsVerify() : MachineFunctionPass(ID) {		PPCCTRLoopsVerify() : MachineFunctionPass(ID) {
initializePPCCTRLoopsVerifyPass(*PassRegistry::getPassRegistry());		initializePPCCTRLoopsVerifyPass(*PassRegistry::getPassRegistry());
Show All 9 Lines	#ifndef NDEBUG
private:		private:
MachineDominatorTree *MDT;		MachineDominatorTree *MDT;
};		};

char PPCCTRLoopsVerify::ID = 0;		char PPCCTRLoopsVerify::ID = 0;
#endif // NDEBUG		#endif // NDEBUG
} // end anonymous namespace		} // end anonymous namespace

INITIALIZE_PASS_BEGIN(PPCCTRLoops, "ppc-ctr-loops", "PowerPC CTR Loops",
false, false)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
INITIALIZE_PASS_END(PPCCTRLoops, "ppc-ctr-loops", "PowerPC CTR Loops",
false, false)

FunctionPass *llvm::createPPCCTRLoops() { return new PPCCTRLoops(); }

#ifndef NDEBUG		#ifndef NDEBUG
INITIALIZE_PASS_BEGIN(PPCCTRLoopsVerify, "ppc-ctr-loops-verify",		INITIALIZE_PASS_BEGIN(PPCCTRLoopsVerify, "ppc-ctr-loops-verify",
"PowerPC CTR Loops Verify", false, false)		"PowerPC CTR Loops Verify", false, false)
INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)		INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
INITIALIZE_PASS_END(PPCCTRLoopsVerify, "ppc-ctr-loops-verify",		INITIALIZE_PASS_END(PPCCTRLoopsVerify, "ppc-ctr-loops-verify",
"PowerPC CTR Loops Verify", false, false)		"PowerPC CTR Loops Verify", false, false)

FunctionPass *llvm::createPPCCTRLoopsVerify() {		FunctionPass *llvm::createPPCCTRLoopsVerify() {
return new PPCCTRLoopsVerify();		return new PPCCTRLoopsVerify();
}		}
#endif // NDEBUG		#endif // NDEBUG

bool PPCCTRLoops::runOnFunction(Function &F) {
if (skipFunction(F))
return false;

auto *TPC = getAnalysisIfAvailable<TargetPassConfig>();
if (!TPC)
return false;

TM = &TPC->getTM<PPCTargetMachine>();
STI = TM->getSubtargetImpl(F);
TLI = STI->getTargetLowering();

LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
DL = &F.getParent()->getDataLayout();
auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
LibInfo = TLIP ? &TLIP->getTLI() : nullptr;
PreserveLCSSA = mustPreserveAnalysisID(LCSSAID);
SchedModel.init(STI);

bool MadeChange = false;

for (LoopInfo::iterator I = LI->begin(), E = LI->end();
I != E; ++I) {
Loop L = I;
if (!L->getParentLoop())
MadeChange \|= convertToCTRLoop(L);
}

return MadeChange;
}

static bool isLargeIntegerTy(bool Is32Bit, Type *Ty) {
if (IntegerType *ITy = dyn_cast<IntegerType>(Ty))
return ITy->getBitWidth() > (Is32Bit ? 32U : 64U);

return false;
}

// Determining the address of a TLS variable results in a function call in
// certain TLS models.
static bool memAddrUsesCTR(const PPCTargetMachine &TM, const Value *MemAddr) {
const auto *GV = dyn_cast<GlobalValue>(MemAddr);
if (!GV) {
// Recurse to check for constants that refer to TLS global variables.
if (const auto *CV = dyn_cast<Constant>(MemAddr))
for (const auto &CO : CV->operands())
if (memAddrUsesCTR(TM, CO))
return true;

return false;
}

if (!GV->isThreadLocal())
return false;
TLSModel::Model Model = TM.getTLSModel(GV);
return Model == TLSModel::GeneralDynamic \|\| Model == TLSModel::LocalDynamic;
}

// Loop through the inline asm constraints and look for something that clobbers
// ctr.
static bool asmClobbersCTR(InlineAsm *IA) {
InlineAsm::ConstraintInfoVector CIV = IA->ParseConstraints();
for (unsigned i = 0, ie = CIV.size(); i < ie; ++i) {
InlineAsm::ConstraintInfo &C = CIV[i];
if (C.Type != InlineAsm::isInput)
for (unsigned j = 0, je = C.Codes.size(); j < je; ++j)
if (StringRef(C.Codes[j]).equals_lower("{ctr}"))
return true;
}
return false;
}

bool PPCCTRLoops::mightUseCTR(BasicBlock *BB) {
for (BasicBlock::iterator J = BB->begin(), JE = BB->end();
J != JE; ++J) {
if (CallInst *CI = dyn_cast<CallInst>(J)) {
// Inline ASM is okay, unless it clobbers the ctr register.
if (InlineAsm *IA = dyn_cast<InlineAsm>(CI->getCalledValue())) {
if (asmClobbersCTR(IA))
return true;
continue;
}

if (Function *F = CI->getCalledFunction()) {
// Most intrinsics don't become function calls, but some might.
// sin, cos, exp and log are always calls.
unsigned Opcode = 0;
if (F->getIntrinsicID() != Intrinsic::not_intrinsic) {
switch (F->getIntrinsicID()) {
default: continue;
// If we have a call to ppc_is_decremented_ctr_nonzero, or ppc_mtctr
// we're definitely using CTR.
case Intrinsic::ppc_is_decremented_ctr_nonzero:
case Intrinsic::ppc_mtctr:
return true;

// VisualStudio defines setjmp as _setjmp
#if defined(_MSC_VER) && defined(setjmp) && \
!defined(setjmp_undefined_for_msvc)
# pragma push_macro("setjmp")
# undef setjmp
# define setjmp_undefined_for_msvc
#endif

case Intrinsic::setjmp:

#if defined(_MSC_VER) && defined(setjmp_undefined_for_msvc)
// let's return it to _setjmp state
# pragma pop_macro("setjmp")
# undef setjmp_undefined_for_msvc
#endif

case Intrinsic::longjmp:

// Exclude eh_sjlj_setjmp; we don't need to exclude eh_sjlj_longjmp
// because, although it does clobber the counter register, the
// control can't then return to inside the loop unless there is also
// an eh_sjlj_setjmp.
case Intrinsic::eh_sjlj_setjmp:

case Intrinsic::memcpy:
case Intrinsic::memmove:
case Intrinsic::memset:
case Intrinsic::powi:
case Intrinsic::log:
case Intrinsic::log2:
case Intrinsic::log10:
case Intrinsic::exp:
case Intrinsic::exp2:
case Intrinsic::pow:
case Intrinsic::sin:
case Intrinsic::cos:
return true;
case Intrinsic::copysign:
if (CI->getArgOperand(0)->getType()->getScalarType()->
isPPC_FP128Ty())
return true;
else
continue; // ISD::FCOPYSIGN is never a library call.
case Intrinsic::sqrt: Opcode = ISD::FSQRT; break;
case Intrinsic::floor: Opcode = ISD::FFLOOR; break;
case Intrinsic::ceil: Opcode = ISD::FCEIL; break;
case Intrinsic::trunc: Opcode = ISD::FTRUNC; break;
case Intrinsic::rint: Opcode = ISD::FRINT; break;
case Intrinsic::nearbyint: Opcode = ISD::FNEARBYINT; break;
case Intrinsic::round: Opcode = ISD::FROUND; break;
case Intrinsic::minnum: Opcode = ISD::FMINNUM; break;
case Intrinsic::maxnum: Opcode = ISD::FMAXNUM; break;
case Intrinsic::umul_with_overflow: Opcode = ISD::UMULO; break;
case Intrinsic::smul_with_overflow: Opcode = ISD::SMULO; break;
}
}

// PowerPC does not use [US]DIVREM or other library calls for
// operations on regular types which are not otherwise library calls
// (i.e. soft float or atomics). If adapting for targets that do,
// additional care is required here.

LibFunc Func;
if (!F->hasLocalLinkage() && F->hasName() && LibInfo &&
LibInfo->getLibFunc(F->getName(), Func) &&
LibInfo->hasOptimizedCodeGen(Func)) {
// Non-read-only functions are never treated as intrinsics.
if (!CI->onlyReadsMemory())
return true;

// Conversion happens only for FP calls.
if (!CI->getArgOperand(0)->getType()->isFloatingPointTy())
return true;

switch (Func) {
default: return true;
case LibFunc_copysign:
case LibFunc_copysignf:
continue; // ISD::FCOPYSIGN is never a library call.
case LibFunc_copysignl:
return true;
case LibFunc_fabs:
case LibFunc_fabsf:
case LibFunc_fabsl:
continue; // ISD::FABS is never a library call.
case LibFunc_sqrt:
case LibFunc_sqrtf:
case LibFunc_sqrtl:
Opcode = ISD::FSQRT; break;
case LibFunc_floor:
case LibFunc_floorf:
case LibFunc_floorl:
Opcode = ISD::FFLOOR; break;
case LibFunc_nearbyint:
case LibFunc_nearbyintf:
case LibFunc_nearbyintl:
Opcode = ISD::FNEARBYINT; break;
case LibFunc_ceil:
case LibFunc_ceilf:
case LibFunc_ceill:
Opcode = ISD::FCEIL; break;
case LibFunc_rint:
case LibFunc_rintf:
case LibFunc_rintl:
Opcode = ISD::FRINT; break;
case LibFunc_round:
case LibFunc_roundf:
case LibFunc_roundl:
Opcode = ISD::FROUND; break;
case LibFunc_trunc:
case LibFunc_truncf:
case LibFunc_truncl:
Opcode = ISD::FTRUNC; break;
case LibFunc_fmin:
case LibFunc_fminf:
case LibFunc_fminl:
Opcode = ISD::FMINNUM; break;
case LibFunc_fmax:
case LibFunc_fmaxf:
case LibFunc_fmaxl:
Opcode = ISD::FMAXNUM; break;
}
}

if (Opcode) {
EVT EVTy =
TLI->getValueType(*DL, CI->getArgOperand(0)->getType(), true);

if (EVTy == MVT::Other)
return true;

if (TLI->isOperationLegalOrCustom(Opcode, EVTy))
continue;
else if (EVTy.isVector() &&
TLI->isOperationLegalOrCustom(Opcode, EVTy.getScalarType()))
continue;

return true;
}
}

return true;
} else if (isa<BinaryOperator>(J) &&
J->getType()->getScalarType()->isPPC_FP128Ty()) {
// Most operations on ppc_f128 values become calls.
return true;
} else if (isa<UIToFPInst>(J) \|\| isa<SIToFPInst>(J) \|\|
isa<FPToUIInst>(J) \|\| isa<FPToSIInst>(J)) {
CastInst *CI = cast<CastInst>(J);
if (CI->getSrcTy()->getScalarType()->isPPC_FP128Ty() \|\|
CI->getDestTy()->getScalarType()->isPPC_FP128Ty() \|\|
isLargeIntegerTy(!TM->isPPC64(), CI->getSrcTy()->getScalarType()) \|\|
isLargeIntegerTy(!TM->isPPC64(), CI->getDestTy()->getScalarType()))
return true;
} else if (isLargeIntegerTy(!TM->isPPC64(),
J->getType()->getScalarType()) &&
(J->getOpcode() == Instruction::UDiv \|\|
J->getOpcode() == Instruction::SDiv \|\|
J->getOpcode() == Instruction::URem \|\|
J->getOpcode() == Instruction::SRem)) {
return true;
} else if (!TM->isPPC64() &&
isLargeIntegerTy(false, J->getType()->getScalarType()) &&
(J->getOpcode() == Instruction::Shl \|\|
J->getOpcode() == Instruction::AShr \|\|
J->getOpcode() == Instruction::LShr)) {
// Only on PPC32, for 128-bit integers (specifically not 64-bit
// integers), these might be runtime calls.
return true;
} else if (isa<IndirectBrInst>(J) \|\| isa<InvokeInst>(J)) {
// On PowerPC, indirect jumps use the counter register.
return true;
} else if (SwitchInst *SI = dyn_cast<SwitchInst>(J)) {
if (SI->getNumCases() + 1 >= (unsigned)TLI->getMinimumJumpTableEntries())
return true;
}

// FREM is always a call.
if (J->getOpcode() == Instruction::FRem)
return true;

if (STI->useSoftFloat()) {
switch(J->getOpcode()) {
case Instruction::FAdd:
case Instruction::FSub:
case Instruction::FMul:
case Instruction::FDiv:
case Instruction::FPTrunc:
case Instruction::FPExt:
case Instruction::FPToUI:
case Instruction::FPToSI:
case Instruction::UIToFP:
case Instruction::SIToFP:
case Instruction::FCmp:
return true;
}
}

for (Value *Operand : J->operands())
if (memAddrUsesCTR(*TM, Operand))
return true;
}

return false;
}
bool PPCCTRLoops::convertToCTRLoop(Loop *L) {
bool MadeChange = false;

// Do not convert small short loops to CTR loop.
unsigned ConstTripCount = SE->getSmallConstantTripCount(L);
if (ConstTripCount && ConstTripCount < SmallCTRLoopThreshold) {
SmallPtrSet<const Value *, 32> EphValues;
auto AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(
*L->getHeader()->getParent());
CodeMetrics::collectEphemeralValues(L, &AC, EphValues);
CodeMetrics Metrics;
for (BasicBlock *BB : L->blocks())
Metrics.analyzeBasicBlock(BB, *TTI, EphValues);
// 6 is an approximate latency for the mtctr instruction.
if (Metrics.NumInsts <= (6 * SchedModel.getIssueWidth()))
return false;
}

// Process nested loops first.
for (Loop::iterator I = L->begin(), E = L->end(); I != E; ++I) {
MadeChange \|= convertToCTRLoop(*I);
LLVM_DEBUG(dbgs() << "Nested loop converted\n");
}

// If a nested loop has been converted, then we can't convert this loop.
if (MadeChange)
return MadeChange;

// Bail out if the loop has irreducible control flow.
LoopBlocksRPO RPOT(L);
RPOT.perform(LI);
if (containsIrreducibleCFG<const BasicBlock >(RPOT, LI))
return false;

#ifndef NDEBUG
// Stop trying after reaching the limit (if any).
int Limit = CTRLoopLimit;
if (Limit >= 0) {
if (Counter >= CTRLoopLimit)
return false;
Counter++;
}
#endif

// We don't want to spill/restore the counter register, and so we don't
// want to use the counter register if the loop contains calls.
for (Loop::block_iterator I = L->block_begin(), IE = L->block_end();
I != IE; ++I)
if (mightUseCTR(*I))
return MadeChange;

SmallVector<BasicBlock*, 4> ExitingBlocks;
L->getExitingBlocks(ExitingBlocks);

// If there is an exit edge known to be frequently taken,
// we should not transform this loop.
for (auto &BB : ExitingBlocks) {
Instruction *TI = BB->getTerminator();
if (!TI) continue;

if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {
uint64_t TrueWeight = 0, FalseWeight = 0;
if (!BI->isConditional() \|\|
!BI->extractProfMetadata(TrueWeight, FalseWeight))
continue;

// If the exit path is more frequent than the loop path,
// we return here without further analysis for this loop.
bool TrueIsExit = !L->contains(BI->getSuccessor(0));
if (( TrueIsExit && FalseWeight < TrueWeight) \|\|
(!TrueIsExit && FalseWeight > TrueWeight))
return MadeChange;
}
}

BasicBlock *CountedExitBlock = nullptr;
const SCEV *ExitCount = nullptr;
BranchInst *CountedExitBranch = nullptr;
for (SmallVectorImpl<BasicBlock *>::iterator I = ExitingBlocks.begin(),
IE = ExitingBlocks.end(); I != IE; ++I) {
const SCEV EC = SE->getExitCount(L, I);
LLVM_DEBUG(dbgs() << "Exit Count for " << *L << " from block "
<< (I)->getName() << ": " << EC << "\n");
if (isa<SCEVCouldNotCompute>(EC))
continue;
if (const SCEVConstant *ConstEC = dyn_cast<SCEVConstant>(EC)) {
if (ConstEC->getValue()->isZero())
continue;
} else if (!SE->isLoopInvariant(EC, L))
continue;

if (SE->getTypeSizeInBits(EC->getType()) > (TM->isPPC64() ? 64 : 32))
continue;

// If this exiting block is contained in a nested loop, it is not eligible
// for insertion of the branch-and-decrement since the inner loop would
// end up messing up the value in the CTR.
if (LI->getLoopFor(*I) != L)
continue;

// We now have a loop-invariant count of loop iterations (which is not the
// constant zero) for which we know that this loop will not exit via this
// existing block.

// We need to make sure that this block will run on every loop iteration.
// For this to be true, we must dominate all blocks with backedges. Such
// blocks are in-loop predecessors to the header block.
bool NotAlways = false;
for (pred_iterator PI = pred_begin(L->getHeader()),
PIE = pred_end(L->getHeader()); PI != PIE; ++PI) {
if (!L->contains(*PI))
continue;

if (!DT->dominates(I, PI)) {
NotAlways = true;
break;
}
}

if (NotAlways)
continue;

// Make sure this blocks ends with a conditional branch.
Instruction TI = (I)->getTerminator();
if (!TI)
continue;

if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {
if (!BI->isConditional())
continue;

CountedExitBranch = BI;
} else
continue;

// Note that this block may not be the loop latch block, even if the loop
// has a latch block.
CountedExitBlock = *I;
ExitCount = EC;
break;
}

if (!CountedExitBlock)
return MadeChange;

BasicBlock *Preheader = L->getLoopPreheader();

// If we don't have a preheader, then insert one. If we already have a
// preheader, then we can use it (except if the preheader contains a use of
// the CTR register because some such uses might be reordered by the
// selection DAG after the mtctr instruction).
if (!Preheader \|\| mightUseCTR(Preheader))
Preheader = InsertPreheaderForLoop(L, DT, LI, nullptr, PreserveLCSSA);
if (!Preheader)
return MadeChange;

LLVM_DEBUG(dbgs() << "Preheader for exit count: " << Preheader->getName()
<< "\n");

// Insert the count into the preheader and replace the condition used by the
// selected branch.
MadeChange = true;

SCEVExpander SCEVE(SE, DL, "loopcnt");
LLVMContext &C = SE->getContext();
Type *CountType = TM->isPPC64() ? Type::getInt64Ty(C) : Type::getInt32Ty(C);
if (!ExitCount->getType()->isPointerTy() &&
ExitCount->getType() != CountType)
ExitCount = SE->getZeroExtendExpr(ExitCount, CountType);
ExitCount = SE->getAddExpr(ExitCount, SE->getOne(CountType));
Value *ECValue =
SCEVE.expandCodeFor(ExitCount, CountType, Preheader->getTerminator());

IRBuilder<> CountBuilder(Preheader->getTerminator());
Module *M = Preheader->getParent()->getParent();
Function *MTCTRFunc =
Intrinsic::getDeclaration(M, Intrinsic::ppc_mtctr, CountType);
CountBuilder.CreateCall(MTCTRFunc, ECValue);

IRBuilder<> CondBuilder(CountedExitBranch);
Function *DecFunc =
Intrinsic::getDeclaration(M, Intrinsic::ppc_is_decremented_ctr_nonzero);
Value *NewCond = CondBuilder.CreateCall(DecFunc, {});
Value *OldCond = CountedExitBranch->getCondition();
CountedExitBranch->setCondition(NewCond);

// The false branch must exit the loop.
if (!L->contains(CountedExitBranch->getSuccessor(0)))
CountedExitBranch->swapSuccessors();

// The old condition may be dead now, and may have even created a dead PHI
// (the original induction variable).
RecursivelyDeleteTriviallyDeadInstructions(OldCond);
// Run through the basic blocks of the loop and see if any of them have dead
// PHIs that can be removed.
for (auto I : L->blocks())
DeleteDeadPHIs(I);

++NumCTRLoops;
return MadeChange;
}

#ifndef NDEBUG		#ifndef NDEBUG
static bool clobbersCTR(const MachineInstr &MI) {		static bool clobbersCTR(const MachineInstr &MI) {
for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
const MachineOperand &MO = MI.getOperand(i);		const MachineOperand &MO = MI.getOperand(i);
if (MO.isReg()) {		if (MO.isReg()) {
if (MO.isDef() && (MO.getReg() == PPC::CTR \|\| MO.getReg() == PPC::CTR8))		if (MO.isDef() && (MO.getReg() == PPC::CTR \|\| MO.getReg() == PPC::CTR8))
return true;		return true;
} else if (MO.isRegMask()) {		} else if (MO.isRegMask()) {
▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCISelLowering.cpp

	Show First 20 Lines • Show All 992 Lines • ▼ Show 20 Lines

	Results.push_back(RTB);			Results.push_back(RTB);
	Results.push_back(RTB.getValue(1));			Results.push_back(RTB.getValue(1));
	Results.push_back(RTB.getValue(2));			Results.push_back(RTB.getValue(2));
	break;			break;
	}			}
	case ISD::INTRINSIC_W_CHAIN: {			case ISD::INTRINSIC_W_CHAIN: {
	if (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue() !=			if (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue() !=
	Intrinsic::ppc_is_decremented_ctr_nonzero)			Intrinsic::loop_dec)
	break;			break;

	assert(N->getValueType(0) == MVT::i1 &&			//assert(N->getValueType(0) == MVT::i1 &&
	"Unexpected result type for CTR decrement intrinsic");			// "Unexpected result type for CTR decrement intrinsic");
	EVT SVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),			EVT SVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
	N->getValueType(0));			N->getValueType(0));
	SDVTList VTs = DAG.getVTList(SVT, MVT::Other);			SDVTList VTs = DAG.getVTList(SVT, MVT::Other);
	SDValue NewInt = DAG.getNode(N->getOpcode(), dl, VTs, N->getOperand(0),			SDValue NewInt = DAG.getNode(N->getOpcode(), dl, VTs, N->getOperand(0),
	N->getOperand(1));			N->getOperand(1));

	Results.push_back(DAG.getNode(ISD::TRUNCATE, dl, MVT::i1, NewInt));			Results.push_back(DAG.getNode(ISD::TRUNCATE, dl, MVT::i1, NewInt));
	Results.push_back(NewInt.getValue(1));			Results.push_back(NewInt.getValue(1));
	▲ Show 20 Lines • Show All 1,984 Lines • ▼ Show 20 Lines
	if (FlagUser->getOpcode() == PPCISD::MFOCRF)			if (FlagUser->getOpcode() == PPCISD::MFOCRF)
	return SDValue(VCMPoNode, 0);			return SDValue(VCMPoNode, 0);
	}			}
	break;			break;
	case ISD::BRCOND: {			case ISD::BRCOND: {
	SDValue Cond = N->getOperand(1);			SDValue Cond = N->getOperand(1);
	SDValue Target = N->getOperand(2);			SDValue Target = N->getOperand(2);

	if (Cond.getOpcode() == ISD::INTRINSIC_W_CHAIN &&			if (Cond.getOpcode() == ISD::SETCC &&
	cast<ConstantSDNode>(Cond.getOperand(1))->getZExtValue() ==			Cond.getOperand(0).getOpcode() == ISD::INTRINSIC_W_CHAIN &&
	Intrinsic::ppc_is_decremented_ctr_nonzero) {			cast<ConstantSDNode>(Cond.getOperand(0).getOperand(1))->getZExtValue() ==
				Intrinsic::loop_dec) {

				Cond = Cond.getOperand(0);
	// We now need to make the intrinsic dead (it cannot be instruction			// We now need to make the intrinsic dead (it cannot be instruction
	// selected).			// selected).
	DAG.ReplaceAllUsesOfValueWith(Cond.getValue(1), Cond.getOperand(0));			DAG.ReplaceAllUsesOfValueWith(Cond.getValue(1), Cond.getOperand(0));
	assert(Cond.getNode()->hasOneUse() &&			assert(Cond.getNode()->hasOneUse() &&
	"Counter decrement has more than one use");			"Counter decrement has more than one use");

	return DAG.getNode(PPCISD::BDNZ, dl, MVT::Other,			return DAG.getNode(PPCISD::BDNZ, dl, MVT::Other,
	N->getOperand(0), Target);			N->getOperand(0), Target);
	}			}
	}			}
	break;			break;
	case ISD::BR_CC: {			case ISD::BR_CC: {
	// If this is a branch on an altivec predicate comparison, lower this so			// If this is a branch on an altivec predicate comparison, lower this so
	// that we don't have to do a MFOCRF: instead, branch directly on CR6. This			// that we don't have to do a MFOCRF: instead, branch directly on CR6. This
	// lowering is done pre-legalize, because the legalizer lowers the predicate			// lowering is done pre-legalize, because the legalizer lowers the predicate
	// compare down to code that is difficult to reassemble.			// compare down to code that is difficult to reassemble.
	ISD::CondCode CC = cast<CondCodeSDNode>(N->getOperand(1))->get();			ISD::CondCode CC = cast<CondCodeSDNode>(N->getOperand(1))->get();
	SDValue LHS = N->getOperand(2), RHS = N->getOperand(3);			SDValue LHS = N->getOperand(2), RHS = N->getOperand(3);

	// Sometimes the promoted value of the intrinsic is ANDed by some non-zero			// Sometimes the promoted value of the intrinsic is ANDed by some non-zero
	// value. If so, pass-through the AND to get to the intrinsic.			// value. If so, pass-through the AND to get to the intrinsic.
	if (LHS.getOpcode() == ISD::AND &&			if (LHS.getOpcode() == ISD::AND &&
	LHS.getOperand(0).getOpcode() == ISD::INTRINSIC_W_CHAIN &&			LHS.getOperand(0).getOpcode() == ISD::INTRINSIC_W_CHAIN &&
	cast<ConstantSDNode>(LHS.getOperand(0).getOperand(1))->getZExtValue() ==			cast<ConstantSDNode>(LHS.getOperand(0).getOperand(1))->getZExtValue() ==
	Intrinsic::ppc_is_decremented_ctr_nonzero &&			Intrinsic::loop_dec &&
	isa<ConstantSDNode>(LHS.getOperand(1)) &&			isa<ConstantSDNode>(LHS.getOperand(1)) &&
	!isNullConstant(LHS.getOperand(1)))			!isNullConstant(LHS.getOperand(1)))
	LHS = LHS.getOperand(0);			LHS = LHS.getOperand(0);

	if (LHS.getOpcode() == ISD::INTRINSIC_W_CHAIN &&			if (LHS.getOpcode() == ISD::INTRINSIC_W_CHAIN &&
	cast<ConstantSDNode>(LHS.getOperand(1))->getZExtValue() ==			cast<ConstantSDNode>(LHS.getOperand(1))->getZExtValue() ==
	Intrinsic::ppc_is_decremented_ctr_nonzero &&			Intrinsic::loop_dec &&
	isa<ConstantSDNode>(RHS)) {			isa<ConstantSDNode>(RHS)) {
	assert((CC == ISD::SETEQ \|\| CC == ISD::SETNE) &&			assert((CC == ISD::SETEQ \|\| CC == ISD::SETNE) &&
	"Counter decrement comparison is not EQ or NE");			"Counter decrement comparison is not EQ or NE");

	unsigned Val = cast<ConstantSDNode>(RHS)->getZExtValue();			unsigned Val = cast<ConstantSDNode>(RHS)->getZExtValue();
	bool isBDNZ = (CC == ISD::SETEQ && Val) \|\|			bool isBDNZ = (CC == ISD::SETEQ && Val) \|\|
	(CC == ISD::SETNE && !Val);			(CC == ISD::SETNE && !Val);

	▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstr64Bit.td

Show First 20 Lines • Show All 376 Lines • ▼ Show 20 Lines	def MFCTR8 : XFXForm_1_ext<31, 339, 9, (outs g8rc:$rT), (ins),
PPC970_DGroup_First, PPC970_Unit_FXU;		PPC970_DGroup_First, PPC970_Unit_FXU;
}		}
let Pattern = [(PPCmtctr i64:$rS)], Defs = [CTR8] in {		let Pattern = [(PPCmtctr i64:$rS)], Defs = [CTR8] in {
def MTCTR8 : XFXForm_7_ext<31, 467, 9, (outs), (ins g8rc:$rS),		def MTCTR8 : XFXForm_7_ext<31, 467, 9, (outs), (ins g8rc:$rS),
"mtctr $rS", IIC_SprMTSPR>,		"mtctr $rS", IIC_SprMTSPR>,
PPC970_DGroup_First, PPC970_Unit_FXU;		PPC970_DGroup_First, PPC970_Unit_FXU;
}		}
let hasSideEffects = 1, Defs = [CTR8] in {		let hasSideEffects = 1, Defs = [CTR8] in {
let Pattern = [(int_ppc_mtctr i64:$rS)] in		let Pattern = [(int_set_loop_iterations i64:$rS)] in
def MTCTR8loop : XFXForm_7_ext<31, 467, 9, (outs), (ins g8rc:$rS),		def MTCTR8loop : XFXForm_7_ext<31, 467, 9, (outs), (ins g8rc:$rS),
"mtctr $rS", IIC_SprMTSPR>,		"mtctr $rS", IIC_SprMTSPR>,
PPC970_DGroup_First, PPC970_Unit_FXU;		PPC970_DGroup_First, PPC970_Unit_FXU;
}		}

let Pattern = [(set i64:$rT, readcyclecounter)] in		let Pattern = [(set i64:$rT, readcyclecounter)] in
def MFTB8 : XFXForm_1_ext<31, 339, 268, (outs g8rc:$rT), (ins),		def MFTB8 : XFXForm_1_ext<31, 339, 268, (outs g8rc:$rT), (ins),
"mfspr $rT, 268", IIC_SprMFTB>,		"mfspr $rT, 268", IIC_SprMFTB>,
▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrInfo.td

	Show First 20 Lines • Show All 992 Lines • ▼ Show 20 Lines
	PPC970_DGroup_First, PPC970_Unit_FXU;			PPC970_DGroup_First, PPC970_Unit_FXU;
	}			}
	let Defs = [CTR], Pattern = [(PPCmtctr i32:$rS)] in {			let Defs = [CTR], Pattern = [(PPCmtctr i32:$rS)] in {
	def MTCTR : XFXForm_7_ext<31, 467, 9, (outs), (ins gprc:$rS),			def MTCTR : XFXForm_7_ext<31, 467, 9, (outs), (ins gprc:$rS),
	"mtctr $rS", IIC_SprMTSPR>,			"mtctr $rS", IIC_SprMTSPR>,
	PPC970_DGroup_First, PPC970_Unit_FXU;			PPC970_DGroup_First, PPC970_Unit_FXU;
	}			}
	let hasSideEffects = 1, isCodeGenOnly = 1, Defs = [CTR] in {			let hasSideEffects = 1, isCodeGenOnly = 1, Defs = [CTR] in {
	let Pattern = [(int_ppc_mtctr i32:$rS)] in			let Pattern = [(int_set_loop_iterations i32:$rS)] in
	def MTCTRloop : XFXForm_7_ext<31, 467, 9, (outs), (ins gprc:$rS),			def MTCTRloop : XFXForm_7_ext<31, 467, 9, (outs), (ins gprc:$rS),
	"mtctr $rS", IIC_SprMTSPR>,			"mtctr $rS", IIC_SprMTSPR>,
	PPC970_DGroup_First, PPC970_Unit_FXU;			PPC970_DGroup_First, PPC970_Unit_FXU;
	}			}

	let Defs = [LR] in {			let Defs = [LR] in {
	def MTLR : XFXForm_7_ext<31, 467, 8, (outs), (ins gprc:$rS),			def MTLR : XFXForm_7_ext<31, 467, 8, (outs), (ins gprc:$rS),
	"mtlr $rS", IIC_SprMTSPR>,			"mtlr $rS", IIC_SprMTSPR>,
	▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCTargetMachine.cpp

Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	ReduceCRLogical("ppc-reduce-cr-logicals",
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);
extern "C" void LLVMInitializePowerPCTarget() {		extern "C" void LLVMInitializePowerPCTarget() {
// Register the targets		// Register the targets
RegisterTargetMachine<PPCTargetMachine> A(getThePPC32Target());		RegisterTargetMachine<PPCTargetMachine> A(getThePPC32Target());
RegisterTargetMachine<PPCTargetMachine> B(getThePPC64Target());		RegisterTargetMachine<PPCTargetMachine> B(getThePPC64Target());
RegisterTargetMachine<PPCTargetMachine> C(getThePPC64LETarget());		RegisterTargetMachine<PPCTargetMachine> C(getThePPC64LETarget());

PassRegistry &PR = *PassRegistry::getPassRegistry();		PassRegistry &PR = *PassRegistry::getPassRegistry();
initializePPCCTRLoopsPass(PR);
#ifndef NDEBUG		#ifndef NDEBUG
initializePPCCTRLoopsVerifyPass(PR);		initializePPCCTRLoopsVerifyPass(PR);
#endif		#endif
initializePPCLoopPreIncPrepPass(PR);		initializePPCLoopPreIncPrepPass(PR);
initializePPCTOCRegDepsPass(PR);		initializePPCTOCRegDepsPass(PR);
initializePPCEarlyReturnPass(PR);		initializePPCEarlyReturnPass(PR);
initializePPCVSXCopyPass(PR);		initializePPCVSXCopyPass(PR);
initializePPCVSXFMAMutatePass(PR);		initializePPCVSXFMAMutatePass(PR);
▲ Show 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	void PPCPassConfig::addIRPasses() {
TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();
}		}

bool PPCPassConfig::addPreISel() {		bool PPCPassConfig::addPreISel() {
if (!DisablePreIncPrep && getOptLevel() != CodeGenOpt::None)		if (!DisablePreIncPrep && getOptLevel() != CodeGenOpt::None)
addPass(createPPCLoopPreIncPrepPass(getPPCTargetMachine()));		addPass(createPPCLoopPreIncPrepPass(getPPCTargetMachine()));

if (!DisableCTRLoops && getOptLevel() != CodeGenOpt::None)		if (!DisableCTRLoops && getOptLevel() != CodeGenOpt::None)
addPass(createPPCCTRLoops());		addPass(createHardwareLoops());

return false;		return false;
}		}

bool PPCPassConfig::addILPOpts() {		bool PPCPassConfig::addILPOpts() {
addPass(&EarlyIfConverterID);		addPass(&EarlyIfConverterID);

if (EnableMachineCombinerPass)		if (EnableMachineCombinerPass)
▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCTargetTransformInfo.h

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	public:

int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);

unsigned getUserCost(const User U, ArrayRef<const Value > Operands);		unsigned getUserCost(const User U, ArrayRef<const Value > Operands);

TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);		TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
		bool mightUseCTR(BasicBlock BB, TargetLibraryInfo LibInfo);
		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		TTI::HardwareLoopInfo &HWLoopInfo);
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{
bool useColdCCForColdCall(Function &F);		bool useColdCCForColdCall(Function &F);
Show All 39 Lines

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

//===-- PPCTargetTransformInfo.cpp - PPC specific TTI ---------------------===//		//===-- PPCTargetTransformInfo.cpp - PPC specific TTI ---------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "PPCTargetTransformInfo.h"		#include "PPCTargetTransformInfo.h"
		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/CodeGen/BasicTTIImpl.h"		#include "llvm/CodeGen/BasicTTIImpl.h"
#include "llvm/CodeGen/CostTable.h"		#include "llvm/CodeGen/CostTable.h"
#include "llvm/CodeGen/TargetLowering.h"		#include "llvm/CodeGen/TargetLowering.h"
		#include "llvm/CodeGen/TargetSchedule.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "ppctti"		#define DEBUG_TYPE "ppctti"

static cl::opt<bool> DisablePPCConstHoist("disable-ppc-constant-hoisting",		static cl::opt<bool> DisablePPCConstHoist("disable-ppc-constant-hoisting",
cl::desc("disable constant hoisting on PPC"), cl::init(false), cl::Hidden);		cl::desc("disable constant hoisting on PPC"), cl::init(false), cl::Hidden);

// This is currently only used for the data prefetch pass which is only enabled		// This is currently only used for the data prefetch pass which is only enabled
// for BG/Q by default.		// for BG/Q by default.
static cl::opt<unsigned>		static cl::opt<unsigned>
CacheLineSize("ppc-loop-prefetch-cache-line", cl::Hidden, cl::init(64),		CacheLineSize("ppc-loop-prefetch-cache-line", cl::Hidden, cl::init(64),
cl::desc("The loop prefetch cache line size"));		cl::desc("The loop prefetch cache line size"));

static cl::opt<bool>		static cl::opt<bool>
EnablePPCColdCC("ppc-enable-coldcc", cl::Hidden, cl::init(false),		EnablePPCColdCC("ppc-enable-coldcc", cl::Hidden, cl::init(false),
cl::desc("Enable using coldcc calling conv for cold "		cl::desc("Enable using coldcc calling conv for cold "
"internal functions"));		"internal functions"));

		// The latency of mtctr is only justified if there are more than 4
		// comparisons that will be removed as a result.
		static cl::opt<unsigned>
		SmallCTRLoopThreshold("min-ctr-loop-threshold", cl::init(4), cl::Hidden,
		cl::desc("Loops with a constant trip count smaller than "
		"this value will not use the count register."));

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// PPC cost model.		// PPC cost model.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

TargetTransformInfo::PopcntSupportKind		TargetTransformInfo::PopcntSupportKind
PPCTTIImpl::getPopcntSupport(unsigned TyWidth) {		PPCTTIImpl::getPopcntSupport(unsigned TyWidth) {
▲ Show 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	if (U->getType()->isVectorTy()) {
// Instructions that need to be split should cost more.		// Instructions that need to be split should cost more.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, U->getType());		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, U->getType());
return LT.first * BaseT::getUserCost(U, Operands);		return LT.first * BaseT::getUserCost(U, Operands);
}		}

return BaseT::getUserCost(U, Operands);		return BaseT::getUserCost(U, Operands);
}		}

		bool PPCTTIImpl::mightUseCTR(BasicBlock *BB,
		TargetLibraryInfo *LibInfo) {
		const PPCTargetMachine &TM = ST->getTargetMachine();

		// Loop through the inline asm constraints and look for something that
		// clobbers ctr.
		auto asmClobbersCTR = [](InlineAsm *IA) {
		InlineAsm::ConstraintInfoVector CIV = IA->ParseConstraints();
		for (unsigned i = 0, ie = CIV.size(); i < ie; ++i) {
		InlineAsm::ConstraintInfo &C = CIV[i];
		if (C.Type != InlineAsm::isInput)
		for (unsigned j = 0, je = C.Codes.size(); j < je; ++j)
		if (StringRef(C.Codes[j]).equals_lower("{ctr}"))
		return true;
		}
		return false;
		};

		// Determining the address of a TLS variable results in a function call in
		// certain TLS models.
		std::function<bool(const Value*)> memAddrUsesCTR =
		[&memAddrUsesCTR, &TM](const Value *MemAddr) -> bool {
		const auto *GV = dyn_cast<GlobalValue>(MemAddr);
		if (!GV) {
		// Recurse to check for constants that refer to TLS global variables.
		if (const auto *CV = dyn_cast<Constant>(MemAddr))
		for (const auto &CO : CV->operands())
		if (memAddrUsesCTR(CO))
		return true;

		return false;
		}

		if (!GV->isThreadLocal())
		return false;
		TLSModel::Model Model = TM.getTLSModel(GV);
		return Model == TLSModel::GeneralDynamic \|\|
		Model == TLSModel::LocalDynamic;
		};

		auto isLargeIntegerTy = [](bool Is32Bit, Type *Ty) {
		if (IntegerType *ITy = dyn_cast<IntegerType>(Ty))
		return ITy->getBitWidth() > (Is32Bit ? 32U : 64U);

		return false;
		};

		for (BasicBlock::iterator J = BB->begin(), JE = BB->end();
		J != JE; ++J) {
		if (CallInst *CI = dyn_cast<CallInst>(J)) {
		// Inline ASM is okay, unless it clobbers the ctr register.
		if (InlineAsm *IA = dyn_cast<InlineAsm>(CI->getCalledValue())) {
		if (asmClobbersCTR(IA))
		return true;
		continue;
		}

		if (Function *F = CI->getCalledFunction()) {
		// Most intrinsics don't become function calls, but some might.
		// sin, cos, exp and log are always calls.
		unsigned Opcode = 0;
		if (F->getIntrinsicID() != Intrinsic::not_intrinsic) {
		switch (F->getIntrinsicID()) {
		default: continue;
		// If we have a call to ppc_is_decremented_ctr_nonzero, or ppc_mtctr
		// we're definitely using CTR.
		case Intrinsic::set_loop_iterations:
		case Intrinsic::loop_dec:
		return true;

		// VisualStudio defines setjmp as _setjmp
		#if defined(_MSC_VER) && defined(setjmp) && \
		!defined(setjmp_undefined_for_msvc)
		# pragma push_macro("setjmp")
		# undef setjmp
		# define setjmp_undefined_for_msvc
		#endif

		case Intrinsic::setjmp:

		#if defined(_MSC_VER) && defined(setjmp_undefined_for_msvc)
		// let's return it to _setjmp state
		# pragma pop_macro("setjmp")
		# undef setjmp_undefined_for_msvc
		#endif

		case Intrinsic::longjmp:

		// Exclude eh_sjlj_setjmp; we don't need to exclude eh_sjlj_longjmp
		// because, although it does clobber the counter register, the
		// control can't then return to inside the loop unless there is also
		// an eh_sjlj_setjmp.
		case Intrinsic::eh_sjlj_setjmp:

		case Intrinsic::memcpy:
		case Intrinsic::memmove:
		case Intrinsic::memset:
		case Intrinsic::powi:
		case Intrinsic::log:
		case Intrinsic::log2:
		case Intrinsic::log10:
		case Intrinsic::exp:
		case Intrinsic::exp2:
		case Intrinsic::pow:
		case Intrinsic::sin:
		case Intrinsic::cos:
		return true;
		case Intrinsic::copysign:
		if (CI->getArgOperand(0)->getType()->getScalarType()->
		isPPC_FP128Ty())
		return true;
		else
		continue; // ISD::FCOPYSIGN is never a library call.
		case Intrinsic::sqrt: Opcode = ISD::FSQRT; break;
		case Intrinsic::floor: Opcode = ISD::FFLOOR; break;
		case Intrinsic::ceil: Opcode = ISD::FCEIL; break;
		case Intrinsic::trunc: Opcode = ISD::FTRUNC; break;
		case Intrinsic::rint: Opcode = ISD::FRINT; break;
		case Intrinsic::nearbyint: Opcode = ISD::FNEARBYINT; break;
		case Intrinsic::round: Opcode = ISD::FROUND; break;
		case Intrinsic::minnum: Opcode = ISD::FMINNUM; break;
		case Intrinsic::maxnum: Opcode = ISD::FMAXNUM; break;
		case Intrinsic::umul_with_overflow: Opcode = ISD::UMULO; break;
		case Intrinsic::smul_with_overflow: Opcode = ISD::SMULO; break;
		}
		}

		// PowerPC does not use [US]DIVREM or other library calls for
		// operations on regular types which are not otherwise library calls
		// (i.e. soft float or atomics). If adapting for targets that do,
		// additional care is required here.

		LibFunc Func;
		if (!F->hasLocalLinkage() && F->hasName() && LibInfo &&
		LibInfo->getLibFunc(F->getName(), Func) &&
		LibInfo->hasOptimizedCodeGen(Func)) {
		// Non-read-only functions are never treated as intrinsics.
		if (!CI->onlyReadsMemory())
		return true;

		// Conversion happens only for FP calls.
		if (!CI->getArgOperand(0)->getType()->isFloatingPointTy())
		return true;

		switch (Func) {
		default: return true;
		case LibFunc_copysign:
		case LibFunc_copysignf:
		continue; // ISD::FCOPYSIGN is never a library call.
		case LibFunc_copysignl:
		return true;
		case LibFunc_fabs:
		case LibFunc_fabsf:
		case LibFunc_fabsl:
		continue; // ISD::FABS is never a library call.
		case LibFunc_sqrt:
		case LibFunc_sqrtf:
		case LibFunc_sqrtl:
		Opcode = ISD::FSQRT; break;
		case LibFunc_floor:
		case LibFunc_floorf:
		case LibFunc_floorl:
		Opcode = ISD::FFLOOR; break;
		case LibFunc_nearbyint:
		case LibFunc_nearbyintf:
		case LibFunc_nearbyintl:
		Opcode = ISD::FNEARBYINT; break;
		case LibFunc_ceil:
		case LibFunc_ceilf:
		case LibFunc_ceill:
		Opcode = ISD::FCEIL; break;
		case LibFunc_rint:
		case LibFunc_rintf:
		case LibFunc_rintl:
		Opcode = ISD::FRINT; break;
		case LibFunc_round:
		case LibFunc_roundf:
		case LibFunc_roundl:
		Opcode = ISD::FROUND; break;
		case LibFunc_trunc:
		case LibFunc_truncf:
		case LibFunc_truncl:
		Opcode = ISD::FTRUNC; break;
		case LibFunc_fmin:
		case LibFunc_fminf:
		case LibFunc_fminl:
		Opcode = ISD::FMINNUM; break;
		case LibFunc_fmax:
		case LibFunc_fmaxf:
		case LibFunc_fmaxl:
		Opcode = ISD::FMAXNUM; break;
		}
		}

		if (Opcode) {
		EVT EVTy =
		TLI->getValueType(DL, CI->getArgOperand(0)->getType(), true);

		if (EVTy == MVT::Other)
		return true;

		if (TLI->isOperationLegalOrCustom(Opcode, EVTy))
		continue;
		else if (EVTy.isVector() &&
		TLI->isOperationLegalOrCustom(Opcode, EVTy.getScalarType()))
		continue;

		return true;
		}
		}

		return true;
		} else if (isa<BinaryOperator>(J) &&
		J->getType()->getScalarType()->isPPC_FP128Ty()) {
		// Most operations on ppc_f128 values become calls.
		return true;
		} else if (isa<UIToFPInst>(J) \|\| isa<SIToFPInst>(J) \|\|
		isa<FPToUIInst>(J) \|\| isa<FPToSIInst>(J)) {
		CastInst *CI = cast<CastInst>(J);
		if (CI->getSrcTy()->getScalarType()->isPPC_FP128Ty() \|\|
		CI->getDestTy()->getScalarType()->isPPC_FP128Ty() \|\|
		isLargeIntegerTy(!TM.isPPC64(), CI->getSrcTy()->getScalarType()) \|\|
		isLargeIntegerTy(!TM.isPPC64(), CI->getDestTy()->getScalarType()))
		return true;
		} else if (isLargeIntegerTy(!TM.isPPC64(),
		J->getType()->getScalarType()) &&
		(J->getOpcode() == Instruction::UDiv \|\|
		J->getOpcode() == Instruction::SDiv \|\|
		J->getOpcode() == Instruction::URem \|\|
		J->getOpcode() == Instruction::SRem)) {
		return true;
		} else if (!TM.isPPC64() &&
		isLargeIntegerTy(false, J->getType()->getScalarType()) &&
		(J->getOpcode() == Instruction::Shl \|\|
		J->getOpcode() == Instruction::AShr \|\|
		J->getOpcode() == Instruction::LShr)) {
		// Only on PPC32, for 128-bit integers (specifically not 64-bit
		// integers), these might be runtime calls.
		return true;
		} else if (isa<IndirectBrInst>(J) \|\| isa<InvokeInst>(J)) {
		// On PowerPC, indirect jumps use the counter register.
		return true;
		} else if (SwitchInst *SI = dyn_cast<SwitchInst>(J)) {
		if (SI->getNumCases() + 1 >= (unsigned)TLI->getMinimumJumpTableEntries())
		return true;
		}

		// FREM is always a call.
		if (J->getOpcode() == Instruction::FRem)
		return true;

		if (ST->useSoftFloat()) {
		switch(J->getOpcode()) {
		case Instruction::FAdd:
		case Instruction::FSub:
		case Instruction::FMul:
		case Instruction::FDiv:
		case Instruction::FPTrunc:
		case Instruction::FPExt:
		case Instruction::FPToUI:
		case Instruction::FPToSI:
		case Instruction::UIToFP:
		case Instruction::SIToFP:
		case Instruction::FCmp:
		return true;
		}
		}

		for (Value *Operand : J->operands())
		if (memAddrUsesCTR(Operand))
		return true;
		}

		return false;
		}

		bool PPCTTIImpl::isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *LibInfo,
		TTI::HardwareLoopInfo &HWLoopInfo) {
		const PPCTargetMachine &TM = ST->getTargetMachine();
		TargetSchedModel SchedModel;
		SchedModel.init(ST);

		// Do not convert small short loops to CTR loop.
		unsigned ConstTripCount = SE.getSmallConstantTripCount(L);
		if (ConstTripCount && ConstTripCount < SmallCTRLoopThreshold) {
		SmallPtrSet<const Value *, 32> EphValues;
		CodeMetrics::collectEphemeralValues(L, &AC, EphValues);
		CodeMetrics Metrics;
		for (BasicBlock *BB : L->blocks())
		Metrics.analyzeBasicBlock(BB, *this, EphValues);
		// 6 is an approximate latency for the mtctr instruction.
		if (Metrics.NumInsts <= (6 * SchedModel.getIssueWidth()))
		return false;
		}

		// We don't want to spill/restore the counter register, and so we don't
		// want to use the counter register if the loop contains calls.
		for (Loop::block_iterator I = L->block_begin(), IE = L->block_end();
		I != IE; ++I)
		if (mightUseCTR(*I, LibInfo))
		return false;

		SmallVector<BasicBlock*, 4> ExitingBlocks;
		L->getExitingBlocks(ExitingBlocks);

		// If there is an exit edge known to be frequently taken,
		// we should not transform this loop.
		for (auto &BB : ExitingBlocks) {
		Instruction *TI = BB->getTerminator();
		if (!TI) continue;

		if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {
		uint64_t TrueWeight = 0, FalseWeight = 0;
		if (!BI->isConditional() \|\|
		!BI->extractProfMetadata(TrueWeight, FalseWeight))
		continue;

		// If the exit path is more frequent than the loop path,
		// we return here without further analysis for this loop.
		bool TrueIsExit = !L->contains(BI->getSuccessor(0));
		if (( TrueIsExit && FalseWeight < TrueWeight) \|\|
		(!TrueIsExit && FalseWeight > TrueWeight))
		return false;
		}
		}

		LLVMContext &C = L->getHeader()->getParent()->getParent()->getContext();
		HWLoopInfo.CountType = TM.isPPC64() ?
		Type::getInt64Ty(C) : Type::getInt32Ty(C);

		return true;
		}

void PPCTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void PPCTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
if (ST->getDarwinDirective() == PPC::DIR_A2) {		if (ST->getDarwinDirective() == PPC::DIR_A2) {
// The A2 is in-order with a deep pipeline, and concatenation unrolling		// The A2 is in-order with a deep pipeline, and concatenation unrolling
// helps expose latency-hiding opportunities to the instruction scheduler.		// helps expose latency-hiding opportunities to the instruction scheduler.
UP.Partial = UP.Runtime = true;		UP.Partial = UP.Runtime = true;

// We unroll a lot on the A2 (hundreds of instructions), and the benefits		// We unroll a lot on the A2 (hundreds of instructions), and the benefits
▲ Show 20 Lines • Show All 324 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/ctrloop-intrin.ll

Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines	for.cond.112.preheader: ; preds = %if.end.138, %if.end.105
%int_part_ptr.02534 = ptrtoint i8* %int_part_ptr.0253 to i64		%int_part_ptr.02534 = ptrtoint i8* %int_part_ptr.0253 to i64
%cmp114.249 = icmp eq i8* %call109, %int_part_ptr.0253		%cmp114.249 = icmp eq i8* %call109, %int_part_ptr.0253
br i1 %cmp114.249, label %if.end.138, label %for.body.116.preheader		br i1 %cmp114.249, label %if.end.138, label %for.body.116.preheader

for.body.116.preheader: ; preds = %for.cond.112.preheader		for.body.116.preheader: ; preds = %for.cond.112.preheader
%8 = sub i64 0, %int_part_ptr.02534		%8 = sub i64 0, %int_part_ptr.02534
%scevgep5 = getelementptr i8, i8* %call109, i64 %8		%scevgep5 = getelementptr i8, i8* %call109, i64 %8
%scevgep56 = ptrtoint i8* %scevgep5 to i64		%scevgep56 = ptrtoint i8* %scevgep5 to i64
call void @llvm.ppc.mtctr.i64(i64 %scevgep56)		call void @llvm.set.loop.iterations.i64(i64 %scevgep56)
br label %for.body.116		br label %for.body.116

for.cond.cleanup: ; preds = %if.end.138, %if.end.105		for.cond.cleanup: ; preds = %if.end.138, %if.end.105
%int_part_ptr.0.lcssa = phi i8* [ %add.ptr106, %if.end.105 ], [ %int_part_ptr.1, %if.end.138 ]		%int_part_ptr.0.lcssa = phi i8* [ %add.ptr106, %if.end.105 ], [ %int_part_ptr.1, %if.end.138 ]
%9 = bitcast [512 x i8]* %buf to i8*		%9 = bitcast [512 x i8]* %buf to i8*
%call142 = call i8* @halide_string_to_string(i8* %dst.addr.0, i8* %end, i8* %int_part_ptr.0.lcssa) #3		%call142 = call i8* @halide_string_to_string(i8* %dst.addr.0, i8* %end, i8* %int_part_ptr.0.lcssa) #3
%call143 = call i8* @halide_string_to_string(i8* %call142, i8* %end, i8* getelementptr inbounds ([2 x i8], [2 x i8]* @.str.9.96, i64 0, i64 0)) #3		%call143 = call i8* @halide_string_to_string(i8* %call142, i8* %end, i8* getelementptr inbounds ([2 x i8], [2 x i8]* @.str.9.96, i64 0, i64 0)) #3
%call144 = call i8* @halide_int64_to_string(i8* %call143, i8* %end, i64 %fractional_part.2, i32 6) #3		%call144 = call i8* @halide_int64_to_string(i8* %call143, i8* %end, i64 %fractional_part.2, i32 6) #3
Show All 18 Lines	for.body.116: ; preds = %for.body.116, %for.body.116.preheader
%cmp125 = icmp sgt i8 %11, 9		%cmp125 = icmp sgt i8 %11, 9
%sub128 = add nsw i32 %add122, 246		%sub128 = add nsw i32 %add122, 246
%carry.1 = zext i1 %cmp125 to i32		%carry.1 = zext i1 %cmp125 to i32
%new_digit.0.in = select i1 %cmp125, i32 %sub128, i32 %add122		%new_digit.0.in = select i1 %cmp125, i32 %sub128, i32 %add122
%add133 = add nsw i32 %new_digit.0.in, 48		%add133 = add nsw i32 %new_digit.0.in, 48
%conv134 = trunc i32 %add133 to i8		%conv134 = trunc i32 %add133 to i8
%scevgep = getelementptr i8, i8* inttoptr (i64 -1 to i8*), i64 %call109.pn2		%scevgep = getelementptr i8, i8* inttoptr (i64 -1 to i8*), i64 %call109.pn2
store i8 %conv134, i8* %scevgep, align 1, !tbaa !10		store i8 %conv134, i8* %scevgep, align 1, !tbaa !10
%12 = call i1 @llvm.ppc.is.decremented.ctr.nonzero()		%12 = call i64 @llvm.loop.dec(i64 %scevgep56, i64 1)
br i1 %12, label %for.body.116, label %for.cond.cleanup.115		%dec.cmp = icmp ne i64 %12, 0
		br i1 %dec.cmp, label %for.body.116, label %for.cond.cleanup.115

if.then.136: ; preds = %for.cond.cleanup.115		if.then.136: ; preds = %for.cond.cleanup.115
%incdec.ptr137 = getelementptr inbounds i8, i8* %int_part_ptr.0253, i64 -1		%incdec.ptr137 = getelementptr inbounds i8, i8* %int_part_ptr.0253, i64 -1
store i8 49, i8* %incdec.ptr137, align 1, !tbaa !10		store i8 49, i8* %incdec.ptr137, align 1, !tbaa !10
br label %if.end.138		br label %if.end.138

if.end.138: ; preds = %if.then.136, %for.cond.cleanup.115, %for.cond.112.preheader		if.end.138: ; preds = %if.then.136, %for.cond.cleanup.115, %for.cond.112.preheader
%int_part_ptr.1 = phi i8* [ %incdec.ptr137, %if.then.136 ], [ %call109, %for.cond.112.preheader ], [ %int_part_ptr.0253, %for.cond.cleanup.115 ]		%int_part_ptr.1 = phi i8* [ %incdec.ptr137, %if.then.136 ], [ %call109, %for.cond.112.preheader ], [ %int_part_ptr.0253, %for.cond.cleanup.115 ]
%inc140 = add nuw nsw i32 %i.0255, 1		%inc140 = add nuw nsw i32 %i.0255, 1
%exitcond = icmp eq i32 %inc140, %integer_exponent.0		%exitcond = icmp eq i32 %inc140, %integer_exponent.0
br i1 %exitcond, label %for.cond.cleanup, label %for.cond.112.preheader		br i1 %exitcond, label %for.cond.cleanup, label %for.cond.112.preheader

cleanup.148: ; preds = %for.cond.cleanup, %if.then.64, %if.end.59, %if.else.30, %if.then.28, %if.else.24, %if.then.22, %if.else.13, %if.then.11, %if.else, %if.then.6		cleanup.148: ; preds = %for.cond.cleanup, %if.then.64, %if.end.59, %if.else.30, %if.then.28, %if.else.24, %if.then.22, %if.else.13, %if.then.11, %if.else, %if.then.6
%retval.1 = phi i8* [ %call7, %if.then.6 ], [ %call8, %if.else ], [ %call12, %if.then.11 ], [ %call14, %if.else.13 ], [ %call23, %if.then.22 ], [ %call25, %if.else.24 ], [ %call29, %if.then.28 ], [ %call31, %if.else.30 ], [ %call65, %if.then.64 ], [ %call61, %if.end.59 ], [ %call144, %for.cond.cleanup ]		%retval.1 = phi i8* [ %call7, %if.then.6 ], [ %call8, %if.else ], [ %call12, %if.then.11 ], [ %call14, %if.else.13 ], [ %call23, %if.then.22 ], [ %call25, %if.else.24 ], [ %call29, %if.then.28 ], [ %call31, %if.else.30 ], [ %call65, %if.then.64 ], [ %call61, %if.end.59 ], [ %call144, %for.cond.cleanup ]
%13 = bitcast i64* %bits to i8*		%13 = bitcast i64* %bits to i8*
call void @llvm.lifetime.end.p0i8(i64 8, i8* %13) #0		call void @llvm.lifetime.end.p0i8(i64 8, i8* %13) #0
ret i8* %retval.1		ret i8* %retval.1
}		}

; Function Attrs: nounwind		; Function Attrs: nounwind
declare i8* @memcpy(i8, i8 nocapture readonly, i64) #1		declare i8* @memcpy(i8, i8 nocapture readonly, i64) #1

; Function Attrs: nounwind		; Function Attrs: nounwind
declare void @llvm.ppc.mtctr.i64(i64) #0		declare void @llvm.set.loop.iterations.i64(i64) #0

; Function Attrs: nounwind		; Function Attrs: nounwind
declare i1 @llvm.ppc.is.decremented.ctr.nonzero() #0		declare i64 @llvm.loop.dec(i64, i64) #0

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }		attributes #1 = { nounwind "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { nounwind }		attributes #2 = { nounwind }
attributes #3 = { nounwind }		attributes #3 = { nounwind }

!llvm.ident = !{!0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0}		!llvm.ident = !{!0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0, !0}
!llvm.module.flags = !{!1, !2, !3}		!llvm.module.flags = !{!1, !2, !3}
Show All 12 Lines

test/CodeGen/PowerPC/ppc-passname.ll

	; Test pass name: ppc-ctr-loops.
	; RUN: llc -mtriple=powerpc64le-unknown-unknown < %s -debug-pass=Structure -stop-before=ppc-ctr-loops -o /dev/null 2>&1 \| FileCheck %s -check-prefix=STOP-BEFORE-CTR-LOOPS
	; STOP-BEFORE-CTR-LOOPS-NOT: -ppc-ctr-loops
	; STOP-BEFORE-CTR-LOOPS-NOT: "ppc-ctr-loops" pass is not registered.
	; STOP-BEFORE-CTR-LOOPS-NOT: PowerPC CTR Loops

	; RUN: llc -mtriple=powerpc64le-unknown-unknown < %s -debug-pass=Structure -stop-after=ppc-ctr-loops -o /dev/null 2>&1 \| FileCheck %s -check-prefix=STOP-AFTER-CTR-LOOPS
	; STOP-AFTER-CTR-LOOPS: -ppc-ctr-loops
	; STOP-AFTER-CTR-LOOPS-NOT: "ppc-ctr-loops" pass is not registered.
	; STOP-AFTER-CTR-LOOPS: PowerPC CTR Loops


	; Test pass name: ppc-loop-preinc-prep.			; Test pass name: ppc-loop-preinc-prep.
	; RUN: llc -mtriple=powerpc64le-unknown-unknown < %s -debug-pass=Structure -stop-before=ppc-loop-preinc-prep -o /dev/null 2>&1 \| FileCheck %s -check-prefix=STOP-BEFORE-LOOP-PREINC-PREP			; RUN: llc -mtriple=powerpc64le-unknown-unknown < %s -debug-pass=Structure -stop-before=ppc-loop-preinc-prep -o /dev/null 2>&1 \| FileCheck %s -check-prefix=STOP-BEFORE-LOOP-PREINC-PREP
	; STOP-BEFORE-LOOP-PREINC-PREP-NOT: -ppc-loop-preinc-prep			; STOP-BEFORE-LOOP-PREINC-PREP-NOT: -ppc-loop-preinc-prep
	; STOP-BEFORE-LOOP-PREINC-PREP-NOT: "ppc-loop-preinc-prep" pass is not registered.			; STOP-BEFORE-LOOP-PREINC-PREP-NOT: "ppc-loop-preinc-prep" pass is not registered.
	; STOP-BEFORE-LOOP-PREINC-PREP-NOT: Prepare loop for pre-inc. addressing modes			; STOP-BEFORE-LOOP-PREINC-PREP-NOT: Prepare loop for pre-inc. addressing modes

	; RUN: llc -mtriple=powerpc64le-unknown-unknown < %s -debug-pass=Structure -stop-after=ppc-loop-preinc-prep -o /dev/null 2>&1 \| FileCheck %s -check-prefix=STOP-AFTER-LOOP-PREINC-PREP			; RUN: llc -mtriple=powerpc64le-unknown-unknown < %s -debug-pass=Structure -stop-after=ppc-loop-preinc-prep -o /dev/null 2>&1 \| FileCheck %s -check-prefix=STOP-AFTER-LOOP-PREINC-PREP
	; STOP-AFTER-LOOP-PREINC-PREP: -ppc-loop-preinc-prep			; STOP-AFTER-LOOP-PREINC-PREP: -ppc-loop-preinc-prep
	▲ Show 20 Lines • Show All 110 Lines • Show Last 20 Lines

test/CodeGen/Thumb2/mve-tailpred.ll

This file was added.

				; RUN: opt -mtriple=thumbv8 -mcpu=cortex-a72 %s -arm-hardware-loops -dce -S -o - \| FileCheck %s --check-prefix=OPT
				; RUN: llc -mtriple=thumbv8 -mcpu=cortex-a72 %s -S -o - \| FileCheck %s --check-prefix=LLC

				; CHECK-OPT-LABEL: mul_N
				; CHECK-OPT: %0 = call i32 @llvm.arm.while.setup(i32 %N, i32 4)
				; CHECK-OPT: br i1 %1, label %vector.ph, label %for.cond.cleanup

				; CHECK-OPT: vector.ph:
				; CHECK-OPT: br label %vector.body

				; CHECK-OPT: vecctor.body:
				; CHECK-OPT: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				; CHECK-OPT: %2 = phi i32 [ %N, %vector.ph ], [ %11, %vector.body ]
				; CHECK-OPT: %3 = getelementptr inbounds i32, i32* %a, i32 %index
				; CHECK-OPT: %4 = call <4 x i1> @llvm.arm.get.active.mask.4(i32 %2
				; CHECK-OPT: %wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %5, i32 4, <4 x i1> %4, <4 x i32> undef)
				; CHECK-OPT: %wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %7, i32 4, <4 x i1> %4, <4 x i32> undef)
				; CHECK-OPT: %8 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				; CHECK-OPT: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %8, <4 x i32>* %10, i32 4, <4 x i1> %4)
				; CHECK-OPT: %index.next = add i32 %index, 4
				; CHECK-OPT: %11 = call i32 @llvm.arm.loop.end(i32 %2, i32 4)
				; CHECk-OPT: %12 = icmp ne i32 %11, 0
				; CHECK-OPT: br i1 %12, label %vector.body, label %for.cond.cleanup

				; CHECK-LLC-LABEL: mul_N
				; CHECK-LLC:: wlstp.#4 lr, r3, .LBB0_3
				; CHECK-LLC: .LBB0_2:
				; CHECK-LLC: vldrw q8, [r0]
				; CHECK-LLC: vldrw q9, [r1]
				; CHECK-LLC: adds r0, #16
				; CHECK-LLC: adds r1, #16
				; CHECK-LLC: adds r3, #4
				; CHECK-LLC: vmul.i32 q8, q9, q8
				; CHECK-LLC: vstrw q8, [r2]
				; CHECK-LLC: adds r2, #16
				; CHECK-LLC: letp .LBB0_2
				; CHECK-LLC: b .LBB0_3

				define dso_local arm_aapcs_vfpcc void @mul_N(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph:
				%n.rnd.up = add i32 %N, 3
				%n.vec = and i32 %n.rnd.up, -4
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				br label %vector.body

				vector.body:
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%0 = getelementptr inbounds i32, i32* %a, i32 %index
				%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%2 = bitcast i32* %0 to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)
				%3 = getelementptr inbounds i32, i32* %b, i32 %index
				%4 = bitcast i32* %3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %4, i32 4, <4 x i1> %1, <4 x i32> undef)
				%5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%6 = getelementptr inbounds i32, i32* %c, i32 %index
				%7 = bitcast i32* %6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %5, <4 x i32>* %7, i32 4, <4 x i1> %1)
				%index.next = add i32 %index, 4
				%8 = icmp eq i32 %index.next, %n.vec
				br i1 %8, label %for.cond.cleanup, label %vector.body

				for.cond.cleanup:
				ret void
				}

				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)

				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)

This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Intrinsics for Hardware LoopsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 200890

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

include/llvm/CodeGen/BasicTTIImpl.h

include/llvm/CodeGen/Passes.h

include/llvm/IR/Intrinsics.td

include/llvm/InitializePasses.h

lib/Analysis/TargetTransformInfo.cpp

lib/CodeGen/CMakeLists.txt

lib/CodeGen/HardwareLoops.cpp

lib/Target/ARM/ARM.h

lib/Target/ARM/ARMFinalizeHardwareLoops.cpp

lib/Target/ARM/ARMISelDAGToDAG.cpp

lib/Target/ARM/ARMISelLowering.h

lib/Target/ARM/ARMISelLowering.cpp

lib/Target/ARM/ARMInstrInfo.td

lib/Target/ARM/ARMInstrThumb2.td

lib/Target/ARM/ARMRegisterInfo.td

lib/Target/ARM/ARMTargetMachine.cpp

lib/Target/ARM/ARMTargetTransformInfo.h

lib/Target/ARM/ARMTargetTransformInfo.cpp

lib/Target/ARM/CMakeLists.txt

lib/Target/PowerPC/PPCCTRLoops.cpp

lib/Target/PowerPC/PPCISelLowering.cpp

lib/Target/PowerPC/PPCInstr64Bit.td

lib/Target/PowerPC/PPCInstrInfo.td

lib/Target/PowerPC/PPCTargetMachine.cpp

lib/Target/PowerPC/PPCTargetTransformInfo.h

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

test/CodeGen/PowerPC/ctrloop-intrin.ll

test/CodeGen/PowerPC/ppc-passname.ll

test/CodeGen/Thumb2/mve-tailpred.ll

[RFC] Intrinsics for Hardware Loops
AbandonedPublic