This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroller] Replace UnrollingPreferences::Force with ForceMaxCount + SystemZ getUnrollingPreferences().
ClosedPublic

Authored by jonpa on Sep 12 2016, 4:59 AM.

Download Raw Diff

Details

Reviewers

chandlerc
uweigand
evstupac
hfinkel

Summary

This is an implementation of TargetTransformInfo::getUnrollingPreferences() for SystemZ.

In addition, there is a also a change in the UnrollingPreferences so that the Force member is instead ForceMaxCount. The motivation behind this is that it is important to get rid of tiny loops on SystemZ, and this is beneficial even with forced unrolling (cloned iterations). However, this should only be done with two or three iterations, so that the resulting loop is at least 6-8 instructions (the branch predictor can only handle a taken branch every other cycle). In order to achieve this, it is necessary to limit this with ForceMaxCount.

Somebody, please take a look at the common code change that was introduced here.

Diff Detail

Event Timeline

jonpa updated this revision to Diff 70991.Sep 12 2016, 4:59 AM

jonpa retitled this revision from to [LoopUnroller] Replace UnrollingPreferences::Force with ForceMaxCount + SystemZ getUnrollingPreferences()..

jonpa updated this object.

jonpa added reviewers: uweigand, chandlerc, hfinkel, evstupac.

jonpa added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, sanjoy. · View Herald TranscriptSep 12 2016, 4:59 AM

Hi,

Overall, I don't like that Force unroll has its own MAX counter, but Partial and Runtime do not.
What is your case where you need to bound Force unroll counter, but Runtime is ok?
If you want to introduce a bound for Forced unroll, Threshold bound looks more general.

Do you have a test case to add?

Thanks,
Evgeny

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
262	"else" with brackets is better here
lib/Transforms/Utils/LoopUnroll.cpp
308	Is this possible to move "Count" change to /Scalar/ part? This will also rewrite unroll count set by user (pragma unroll or -unroll-count option).
310	Same. "else" with brackets is better here.

Hi Evgeny,

thanks for review!

Overall, I don't like that Force unroll has its own MAX counter, but Partial and Runtime do not.

There is MaxCount for partial/runtime and also FullUnrollMaxCount. So to me ForceMaxCount seemed logical, since forced unrolling is certainly a bit separate from the rest, to me.

What is your case where you need to bound Force unroll counter, but Runtime is ok?

On z13, max 6 instructions can be decoded every cycle, but the SystemZ branch predictor can only handle a taken branch every other cycle. This means that tiny loops (<6 instructions or so) is really bad, and the main purpose of this patch is to get rid of them.

Before enabling forced unrolling, there were still a very significant amount of such small loops left in benchmarks (SPEC). On SystemZ, a conditional branch is statically predicted as not taken, which is why forced unrolling should work (by inserting conditional exit-branches after each cloned iteration). Experiments showed that performance was best if just force-unrolling two or three times, and not more. So the idea here is to force-unroll just enough to make the loop bigger than 6 instructions at least, but not more.

There is also the issue of store tags depletion on z13. Simply put, the resulting loop should not have too many stores, because that may cause very expensive flushes. This is implemented in the patch by counting all the stores in the loop, and never unrolling past 12 stores in the final loop. This limit must be used for all unrolling, including the forced unrolling (and runtime).

If you want to introduce a bound for Forced unroll, Threshold bound looks more general.

Do you mean I should Threshold as a way to limit the forced unrolling? Could this be done with exact control over resulting iterations?

Do you have a test case to add?

Sorry, but I can only say that this patch finally after a lot of experimentation passed extensive performance tests and is really needed without any functional change.

/Jonas

While I read your explanation of the particular architectural challenges faced, I don't really understand why partial / runtime unrolling plus a remainder loop isn't a strictly superior option compared to forced unrolling with a cap?

I agree that forced unrolling should be always tried last, since partial / runtime should always be preferred. I beleive this is also what happens if you enable -force, right?

I am basically assuming that the LoopUnroller does everything it can to do partial / runtime unrolling, and then as a final possibility also offers forced unrolling. In other words, I first tried enabling partial / runtime unrolling, only to find many loops not unrolled this way. Therefore I thought it was necessary to used forced unrolling as the only means left, thinking that sometimes UnrollRuntimeLoopRemainder() can't for example compute trip count, while this is no problem with forced unrolling.

Are you saying it somehow would be possible to get all loops runtime unrolled instead?

So the idea here is to force-unroll just enough to make the loop bigger than 6 instructions at least, but not more.

Then Threshold is better for your case. You should bound unrolling by 6. And first try to bound Runtime unroll (and therefore Force unroll as well). Only if this does not help create new Threshold for forced unroll.

With your current changes loop that has 150 iterations (UP.Threshold) after unroll is ok... that could unroll by 2 of 50 instructions loop.

Do you mean I should Threshold as a way to limit the forced unrolling? Could this be done with exact control over resulting iterations?

Yes. For IR. Loop unroller calls Loop size estimator. Threshold is bound for instructions after unroll.

Chandler,

On "SPEC", considering only final loops that ended up as 8 machine instructions or less,
with *my patch without forced unrolling*, I get that
`

   48  loops were unrolled, but ended up <= 8 MIs.
10353  loops were not unrolled, because:
  865  pragma disabled (from loop vectorizer -- scalar loop)
  342  UnrolLLoop() : Any of the first five "return false" (!Preheader or !LatchBlock or !L->isSafeToClone() or (!BI || BI->isUnconditional()))
 3174  Loops containing calls - disabled purposefully for SystemZ
 5969  Loops where UnrollLoop() return false if UnrollRuntimeLoopRemainder() returns false, whereof:
       2300  Any of the first three "return false" (!L->getExitingBlock() or !L->isLoopSimplifyForm() or !Exit)
       3669  4th "return false",   // Only unroll loops with a computable trip count, and the trip count needs to be an int value (allowing a pointer type is a TODO item)
`

Same, but the complete patch including forced unrolling:

 606  loops were unrolled, but ended up <= 8 MIs.  (I happen to know that they are practically all >6, though ;-)
4434  loops were not unrolled, because:
 865  pragma disabled (from loop vectorizer -- scalar loop)
 342  UnrolLLoop() : Any of the first five "return false" (!Preheader or !LatchBlock or !L->isSafeToClone() or (!BI || BI->isUnconditional()))
3174  Loops containing calls - disabled purposefully for SystemZ
  54  Loops where UnrollLoop() return false if UnrollRuntimeLoopRemainder() returns false, whereof:
       24  Any of the first three "return false" (!L->getExitingBlock() or !L->isLoopSimplifyForm() or !Exit)
       30  4th "return false",   // Only unroll loops with a computable trip count, and the trip count needs to be an int value (allowing a pointer type is a TODO item)

It is quite clear that for runtime unroller cannot currently handle all loops. The reasons seem to realate to the LoopInfo and SCEV classes.
It is also clear that forced unrolling basically handles nearly everything left.

Evegeny,

I see your point that there is already a loop size estimate available. This however is not an estimate only of instruction count, but also weighs in
execution factors.

My background here is that first of all I am interested in the resulting machine instruction count only, so e.g. a div should only count as one. Therefore I am doing

if (getUserCost(I) != TargetTransformInfo::TargetCostConstants::TCC_Free)
MICountEstimate++;

This patch as it is now gets rid of basically all loops (except those with calls and pragma unroll disable) smaller than 6 instructions, by using an MICountEstimate
of 12. I found that an estimate of 12 will include 99% of loops <= 8 MIs, and that the resulting ForceMaxCount of ceil(12 / MICountEstimate), limited by 3, got
rid of practically 99,9 of loops <= 6 MIs. This is exactly what was the goal with the patch, and this is a very good result that would be a shame to change.

Having worked with this patch for a while, it has become clear that the *exact number of iterations* produced needs to be controlled. The resulting loop should have no more than 12 stores, and it seems bad to have more than 3 exit branches w/ forced unrolling.

This patch did not work very well before, so it was the forced unrolling in exactly this setting that made it all work (both performance wise, and also achieving the objective of getting rid of all small loops). Therefore I would be very reluctant to change any functionality.

So if you really object to this, I migh need to still do my own MICountEstimate, compute the MaxForceCount, and then somehow access the LoopSize estimate of the LoopUnrollPass, and in this indirect way give a new variable ForceThreshold a value that give the same results.

Having worked with this patch for a while, it has become clear that the *exact number of iterations* produced needs to be controlled. The resulting loop should have no more than 12 stores, and it seems bad to have more than 3 exit branches w/ forced unrolling.

Is this true for other unroll candidates (Partial, Runtime)? I'm trying to understand what is so specific about Force unroll.
If You need to control number of stores/branches you can limit MaxCount for all kinds of unroll. Or even better - calculate specific count for your architecture based on number of stores/branches.

Having worked with this patch for a while, it has become clear that the *exact number of iterations* produced needs to be controlled. The resulting loop should have no more than 12 stores, and it seems bad to have more than 3 exit branches w/ forced unrolling.

Is this true for other unroll candidates (Partial, Runtime)? I'm trying to understand what is so specific about Force unroll.

All loops are bounded by number of stores. The difference with forced unrolling is that I want to use a general bound (max 3 iterations.). So even if there are no stores, there is still a limit. The produced loops are different with -force, and benchmarks have shown this is better (to not get e.g. 8 iterations with exit branches, but just 2-3).

If You need to control number of stores/branches you can limit MaxCount for all kinds of unroll. Or even better - calculate specific count for your architecture based on number of stores/branches.

This is what the patch is doing, except it doesn't limit on branches for partial / runtime.

Added brackets around else statements per suggestions.

Evegeny, fixed the braces, but not sure about your other comment.

I would like to mention that small loops is bad on SystemZ for the reason that they make compiler development generally difficult, as they are sensitive to code changes. In other words, if working on any arbitrary optimization, a small code change could potentially cause a regression. This is the reason I have worked on loop unrolling, and why my scheduler patch for SystemZ is yet to be commited (D17260).

So I would like to commit this patch with the reasoning that

Elimination of small loops is important on SystemZ, and only forced unrolling can (at least currently) handle them all (for numbers, see earlier reply to Chandler).
Forced unrolling produces different results but is yet needed for SystemZ. It should however only be used sparingly as a last resort. It has been shown with benchmarks that it is best to do max 2-3 iterations, to get the loop above the "tiny" threshold. This means it does in fact deserve its own max count variable.
The patch has been proven beneficial on benchmarks.

Will first await Evegenys reply, of course.

lib/Transforms/Utils/LoopUnroll.cpp
308	I am not sure what you mean. The patch does not change any previous behaviour for other targets than SystemZ. Please illustrate.

If You need to control number of stores/branches you can limit MaxCount for all kinds of unroll. Or even better - calculate specific count for your architecture based on number of stores/branches.

This is what the patch is doing, except it doesn't limit on branches for partial / runtime.

There are several cases when runtime unroll fails. More than one exit branch is only one reason. That means you are limiting force unroll count even if there are less than 3 exit branches.

What I suggested is to count exit branches (without a dependency on unroll type). Say there are N in the loop. Then if (N > 1), set MaxCount to 3/(N - 1) without introducing ForceMaxCount.
That way if you have 2 or more exit branches - Runtime Unroll will fail, but Count will be not more than 3.
If you have 1 exit branch you will not affect MaxCount and proceed with RuntimeUnroll.
So basically the behavior would be the same but you'll not limit force unroll count by 3 when you have less than 3 exit branches.

lib/Transforms/Utils/LoopUnroll.cpp
310	Even for SystemZ this is not good. And someone else in future could also use ForceMaxCount for his own architecture. It would be good not to change Count in many places.

Per Uli's suggestions, I have now tried an alternate approach of introducing a different change to the common code. Instead of the ForceMaxCount, the DefaultUnrollRuntimeCount value is moved into UnrollingPreferences. This way, SystemZ can set this to 4 instead of 8, which has been shown as good as the earlier version on the benchmarks.

What I suggested is to count exit branches (without a dependency on unroll type)...

I agree it might be interesting to count the number of branches and experiment further with that. It would be nice however to first get the basic patch approved as it is now.

The solution in general part is much better. I like it.
I've made only few general comments in SystemZ specific part. If someone working on SystemZ can review it that would be very good.
Unroll part LGTM.

include/llvm/Analysis/TargetTransformInfo.h
268	if -unroll-count is not set We should mention #pragma as well or nothing.
lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
247	according to the latest commits the following (or smth similar in your case) looks better: for (auto &BB : L->blocks()) for (auto &I : *BB)
261	should be one 1 line } else {

Updated as requested.
NFC

The SystemZ part looks good to me.

Evgeny, thanks for your comments and review of the unroll part!

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
243	Comment now seems to be out of date, the code no longer computes a machine instruction count.

This revision is now accepted and ready to land.Sep 27 2016, 6:44 AM

Thanks Evgeny for review and suggestions!

Commited as r282570

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

Transforms/

Utils/

UnrollLoop.h

11 lines

lib/

Target/

SystemZ/

SystemZTargetTransformInfo.h

2 lines

SystemZTargetTransformInfo.cpp

78 lines

Transforms/

Scalar/

LoopUnrollPass.cpp

8 lines

Utils/

LoopUnroll.cpp

15 lines

Diff 70991

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	struct UnrollingPreferences {
/// UINT_MAX to disable).		/// UINT_MAX to disable).
unsigned PartialOptSizeThreshold;		unsigned PartialOptSizeThreshold;
/// A forced unrolling factor (the number of concatenated bodies of the		/// A forced unrolling factor (the number of concatenated bodies of the
/// original loop in the unrolled loop body). When set to 0, the unrolling		/// original loop in the unrolled loop body). When set to 0, the unrolling
/// transformation will select an unrolling factor based on the current cost		/// transformation will select an unrolling factor based on the current cost
/// threshold and other factors.		/// threshold and other factors.
unsigned Count;		unsigned Count;
// Set the maximum unrolling factor. The unrolling factor may be selected		// Set the maximum unrolling factor. The unrolling factor may be selected
// using the appropriate cost threshold, but may not exceed this number		// using the appropriate cost threshold, but may not exceed this number
		evstupacUnsubmitted Done Reply Inline Actions if -unroll-count is not set We should mention #pragma as well or nothing. evstupac: > if -unroll-count is not set We should mention #pragma as well or nothing.
// (set to UINT_MAX to disable). This does not apply in cases where the		// (set to UINT_MAX to disable). This does not apply in cases where the
// loop is being fully unrolled.		// loop is being fully unrolled.
unsigned MaxCount;		unsigned MaxCount;
/// Set the maximum unrolling factor for full unrolling. Like MaxCount, but		/// Set the maximum unrolling factor for full unrolling. Like MaxCount, but
/// applies even if full unrolling is selected. This allows a target to fall		/// applies even if full unrolling is selected. This allows a target to fall
/// back to Partial unrolling if full unrolling is above FullUnrollMaxCount.		/// back to Partial unrolling if full unrolling is above FullUnrollMaxCount.
unsigned FullUnrollMaxCount;		unsigned FullUnrollMaxCount;
/// Allow partial unrolling (unrolling of loops to expand the size of the		/// Allow partial unrolling (unrolling of loops to expand the size of the
/// loop body, not only to eliminate small constant-trip-count loops).		/// loop body, not only to eliminate small constant-trip-count loops).
bool Partial;		bool Partial;
/// Allow runtime unrolling (unrolling of loops to expand the size of the		/// Allow runtime unrolling (unrolling of loops to expand the size of the
/// loop body even when the number of loop iterations is not known at		/// loop body even when the number of loop iterations is not known at
/// compile time).		/// compile time).
bool Runtime;		bool Runtime;
/// Allow generation of a loop remainder (extra iterations after unroll).		/// Allow generation of a loop remainder (extra iterations after unroll).
bool AllowRemainder;		bool AllowRemainder;
/// Allow emitting expensive instructions (such as divisions) when computing		/// Allow emitting expensive instructions (such as divisions) when computing
/// the trip count of a loop for runtime unrolling.		/// the trip count of a loop for runtime unrolling.
bool AllowExpensiveTripCount;		bool AllowExpensiveTripCount;
/// Apply loop unroll on any kind of loop		/// Apply loop unroll on any kind of loop (mainly to loops that
/// (mainly to loops that fail runtime unrolling).		/// fail runtime unrolling). 0 disables while any other value is
bool Force;		/// the maximum allowed unrolling factor for forced unrolling.
		unsigned ForceMaxCount;
};		};

/// \brief Get target-customized preferences for the generic loop unrolling		/// \brief Get target-customized preferences for the generic loop unrolling
/// transformation. The caller will initialize UP with the current		/// transformation. The caller will initialize UP with the current
/// target-independent defaults.		/// target-independent defaults.
void getUnrollingPreferences(Loop *L, UnrollingPreferences &UP) const;		void getUnrollingPreferences(Loop *L, UnrollingPreferences &UP) const;

/// @}		/// @}
▲ Show 20 Lines • Show All 797 Lines • Show Last 20 Lines

include/llvm/Transforms/Utils/UnrollLoop.h

	Show All 24 Lines
	class Loop;			class Loop;
	class LoopInfo;			class LoopInfo;
	class LPPassManager;			class LPPassManager;
	class MDNode;			class MDNode;
	class Pass;			class Pass;
	class OptimizationRemarkEmitter;			class OptimizationRemarkEmitter;
	class ScalarEvolution;			class ScalarEvolution;

	bool UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,			bool UnrollLoop(Loop *L, unsigned Count, unsigned TripCount,
	bool AllowRuntime, bool AllowExpensiveTripCount,			unsigned ForceMaxCount, bool AllowRuntime,
	unsigned TripMultiple, LoopInfo LI, ScalarEvolution SE,			bool AllowExpensiveTripCount, unsigned TripMultiple,
	DominatorTree DT, AssumptionCache AC,			LoopInfo LI, ScalarEvolution SE, DominatorTree *DT,
	OptimizationRemarkEmitter *ORE, bool PreserveLCSSA);			AssumptionCache AC, OptimizationRemarkEmitter ORE,
				bool PreserveLCSSA);

	bool UnrollRuntimeLoopRemainder(Loop *L, unsigned Count,			bool UnrollRuntimeLoopRemainder(Loop *L, unsigned Count,
	bool AllowExpensiveTripCount,			bool AllowExpensiveTripCount,
	bool UseEpilogRemainder, LoopInfo *LI,			bool UseEpilogRemainder, LoopInfo *LI,
	ScalarEvolution SE, DominatorTree DT,			ScalarEvolution SE, DominatorTree DT,
	bool PreserveLCSSA);			bool PreserveLCSSA);

	MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);			MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);
	}			}

	#endif			#endif

lib/Target/SystemZ/SystemZTargetTransformInfo.h

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	public:
int getIntImmCost(const APInt &Imm, Type *Ty);		int getIntImmCost(const APInt &Imm, Type *Ty);

int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);

TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);		TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);

		void getUnrollingPreferences(Loop *L, TTI::UnrollingPreferences &UP);

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector);		unsigned getRegisterBitWidth(bool Vector);

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

	Show First 20 Lines • Show All 232 Lines • ▼ Show 20 Lines
	TargetTransformInfo::PopcntSupportKind			TargetTransformInfo::PopcntSupportKind
	SystemZTTIImpl::getPopcntSupport(unsigned TyWidth) {			SystemZTTIImpl::getPopcntSupport(unsigned TyWidth) {
	assert(isPowerOf2_32(TyWidth) && "Type width must be power of 2");			assert(isPowerOf2_32(TyWidth) && "Type width must be power of 2");
	if (ST->hasPopulationCount() && TyWidth <= 64)			if (ST->hasPopulationCount() && TyWidth <= 64)
	return TTI::PSK_FastHardware;			return TTI::PSK_FastHardware;
	return TTI::PSK_Software;			return TTI::PSK_Software;
	}			}

				void SystemZTTIImpl::getUnrollingPreferences(Loop *L,
				TTI::UnrollingPreferences &UP) {
				// Find out if L contains a call, what the machine instruction count
				uweigandUnsubmitted Not Done Reply Inline Actions Comment now seems to be out of date, the code no longer computes a machine instruction count. uweigand: Comment now seems to be out of date, the code no longer computes a machine instruction count.
				// estimate is, and how many stores there are.
				bool HasCall = false;
				unsigned MICountEstimate = 0;
				unsigned NumStores = 0;
				evstupacUnsubmitted Done Reply Inline Actions according to the latest commits the following (or smth similar in your case) looks better: for (auto &BB : L->blocks()) for (auto &I : BB) evstupac:* according to the latest commits the following (or smth similar in your case) looks better: ```…
				for (Loop::block_iterator I = L->block_begin(), E = L->block_end();
				I != E; ++I) {
				BasicBlock BB = I;
				for (BasicBlock::iterator J = BB->begin(), JE = BB->end(); J != JE; ++J) {
				Instruction I = &J;
				if (isa<CallInst>(I) \|\| isa<InvokeInst>(I)) {
				ImmutableCallSite CS(I);
				if (const Function *F = CS.getCalledFunction()) {
				if (isLoweredToCall(F))
				HasCall = true;
				if (F->getIntrinsicID() == Intrinsic::memcpy \|\|
				F->getIntrinsicID() == Intrinsic::memset)
				NumStores++;
				}
				evstupacUnsubmitted Done Reply Inline Actions should be one 1 line } else { evstupac: should be one 1 line } else {
				else // indirect call.
				evstupacUnsubmitted Done Reply Inline Actions "else" with brackets is better here evstupac: "else" with brackets is better here
				HasCall = true;
				}
				if (isa<StoreInst>(J)) {
				NumStores++;
				Type *MemAccessTy = J->getOperand(0)->getType();
				if((MemAccessTy->isIntegerTy() \|\| MemAccessTy->isFloatingPointTy()) &&
				(getDataLayout().getTypeSizeInBits(MemAccessTy) == 128))
				NumStores++; // 128 bit fp/int stores get split.
				}
				if (getUserCost(I) !=
				TargetTransformInfo::TargetCostConstants::TCC_Free)
				MICountEstimate++;
				}
				}

				// The z13 processor will run out of store tags if too many stores
				// are fed into it too quickly. Therefore make sure there are not
				// too many stores in the resulting unrolled loop.
				unsigned const Max = (NumStores ? (12 / NumStores) : UINT_MAX);

				if (HasCall) {
				// Only allow full unrolling if loop has any calls.
				UP.FullUnrollMaxCount = Max;
				UP.MaxCount = 1;
				return;
				}

				UP.MaxCount = Max;
				if (UP.MaxCount <= 1)
				return;

				// Allow partial and runtime trip count unrolling.
				UP.Partial = UP.Runtime = true;
				UP.PartialThreshold = UP.Threshold;

				// Allow expensive instructions in the pre-header of the loop.
				UP.AllowExpensiveTripCount = true;

				// If unrolling is forced (i.e. producing cloned iterations each
				// with an exit branch), only do enough to get rid of tiny
				// loops.

				// Compute a smallest count to make the loop at least 8
				// instructions. The estimate for number of MachineIntrs in loop is
				// not always exact, but using a value of 12 gets ~99% of loops <= 8
				// MIs.
				unsigned const MinSize = 12;
				UP.ForceMaxCount =
				((MinSize % MICountEstimate) ? ((MinSize / MICountEstimate) + 1)
				: (MinSize / MICountEstimate));
				// Don't do more than max 3 iterations
				UP.ForceMaxCount = std::min(UP.ForceMaxCount, 3U);
				// Don't go past the limit for number of stores per loop.
				UP.ForceMaxCount = std::min(UP.ForceMaxCount, Max);
				}

	unsigned SystemZTTIImpl::getNumberOfRegisters(bool Vector) {			unsigned SystemZTTIImpl::getNumberOfRegisters(bool Vector) {
	if (!Vector)			if (!Vector)
	// Discount the stack pointer. Also leave out %r0, since it can't			// Discount the stack pointer. Also leave out %r0, since it can't
	// be used in an address.			// be used in an address.
	return 14;			return 14;
	if (ST->hasVector())			if (ST->hasVector())
	return 32;			return 32;
	return 0;			return 0;
	Show All 10 Lines

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
UP.PartialOptSizeThreshold = 0;		UP.PartialOptSizeThreshold = 0;
UP.Count = 0;		UP.Count = 0;
UP.MaxCount = UINT_MAX;		UP.MaxCount = UINT_MAX;
UP.FullUnrollMaxCount = UINT_MAX;		UP.FullUnrollMaxCount = UINT_MAX;
UP.Partial = false;		UP.Partial = false;
UP.Runtime = false;		UP.Runtime = false;
UP.AllowRemainder = true;		UP.AllowRemainder = true;
UP.AllowExpensiveTripCount = false;		UP.AllowExpensiveTripCount = false;
UP.Force = false;		UP.ForceMaxCount = 0;

// Override with any target specific settings		// Override with any target specific settings
TTI.getUnrollingPreferences(L, UP);		TTI.getUnrollingPreferences(L, UP);

// Apply size attributes		// Apply size attributes
if (L->getHeader()->getParent()->optForSize()) {		if (L->getHeader()->getParent()->optForSize()) {
UP.Threshold = UP.OptSizeThreshold;		UP.Threshold = UP.OptSizeThreshold;
UP.PartialThreshold = UP.PartialOptSizeThreshold;		UP.PartialThreshold = UP.PartialOptSizeThreshold;
▲ Show 20 Lines • Show All 564 Lines • ▼ Show 20 Lines	static bool computeUnrollCount(Loop *L, const TargetTransformInfo &TTI,
// feeding it.		// feeding it.
unsigned BEInsns = 2;		unsigned BEInsns = 2;
// Check for explicit Count.		// Check for explicit Count.
// 1st priority is unroll count set by "unroll-count" option.		// 1st priority is unroll count set by "unroll-count" option.
bool UserUnrollCount = UnrollCount.getNumOccurrences() > 0;		bool UserUnrollCount = UnrollCount.getNumOccurrences() > 0;
if (UserUnrollCount) {		if (UserUnrollCount) {
UP.Count = UnrollCount;		UP.Count = UnrollCount;
UP.AllowExpensiveTripCount = true;		UP.AllowExpensiveTripCount = true;
UP.Force = true;		UP.ForceMaxCount = UINT_MAX;
if (UP.AllowRemainder &&		if (UP.AllowRemainder &&
(LoopSize - BEInsns) * UP.Count + BEInsns < UP.Threshold)		(LoopSize - BEInsns) * UP.Count + BEInsns < UP.Threshold)
return true;		return true;
}		}

// 2nd priority is unroll count set by pragma.		// 2nd priority is unroll count set by pragma.
unsigned PragmaCount = UnrollCountPragmaValue(L);		unsigned PragmaCount = UnrollCountPragmaValue(L);
if (PragmaCount > 0) {		if (PragmaCount > 0) {
UP.Count = PragmaCount;		UP.Count = PragmaCount;
UP.Runtime = true;		UP.Runtime = true;
UP.AllowExpensiveTripCount = true;		UP.AllowExpensiveTripCount = true;
UP.Force = true;		UP.ForceMaxCount = UINT_MAX;
if (UP.AllowRemainder &&		if (UP.AllowRemainder &&
(LoopSize - BEInsns) * UP.Count + BEInsns < PragmaUnrollThreshold)		(LoopSize - BEInsns) * UP.Count + BEInsns < PragmaUnrollThreshold)
return true;		return true;
}		}
bool PragmaFullUnroll = HasUnrollFullPragma(L);		bool PragmaFullUnroll = HasUnrollFullPragma(L);
if (PragmaFullUnroll && TripCount != 0) {		if (PragmaFullUnroll && TripCount != 0) {
UP.Count = TripCount;		UP.Count = TripCount;
if ((LoopSize - BEInsns) * UP.Count + BEInsns < PragmaUnrollThreshold)		if ((LoopSize - BEInsns) * UP.Count + BEInsns < PragmaUnrollThreshold)
▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	bool IsCountSetExplicitly = computeUnrollCount(
L, TTI, DT, LI, SE, &ORE, TripCount, TripMultiple, LoopSize, UP);		L, TTI, DT, LI, SE, &ORE, TripCount, TripMultiple, LoopSize, UP);
if (!UP.Count)		if (!UP.Count)
return false;		return false;
// Unroll factor (Count) must be less or equal to TripCount.		// Unroll factor (Count) must be less or equal to TripCount.
if (TripCount && UP.Count > TripCount)		if (TripCount && UP.Count > TripCount)
UP.Count = TripCount;		UP.Count = TripCount;

// Unroll the loop.		// Unroll the loop.
if (!UnrollLoop(L, UP.Count, TripCount, UP.Force, UP.Runtime,		if (!UnrollLoop(L, UP.Count, TripCount, UP.ForceMaxCount, UP.Runtime,
UP.AllowExpensiveTripCount, TripMultiple, LI, SE, &DT, &AC,		UP.AllowExpensiveTripCount, TripMultiple, LI, SE, &DT, &AC,
&ORE, PreserveLCSSA))		&ORE, PreserveLCSSA))
return false;		return false;

// If loop has an unroll count pragma or unrolled by explicitly set count		// If loop has an unroll count pragma or unrolled by explicitly set count
// mark loop as unrolled to prevent unrolling beyond that requested.		// mark loop as unrolled to prevent unrolling beyond that requested.
if (IsCountSetExplicitly)		if (IsCountSetExplicitly)
SetLoopAlreadyUnrolled(L);		SetLoopAlreadyUnrolled(L);
▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

lib/Transforms/Utils/LoopUnroll.cpp

Show First 20 Lines • Show All 195 Lines • ▼ Show 20 Lines
/// iterations before branching into the unrolled loop. UnrollLoop will not		/// iterations before branching into the unrolled loop. UnrollLoop will not
/// runtime-unroll the loop if computing RuntimeTripCount will be expensive and		/// runtime-unroll the loop if computing RuntimeTripCount will be expensive and
/// AllowExpensiveTripCount is false.		/// AllowExpensiveTripCount is false.
///		///
/// The LoopInfo Analysis that is passed will be kept consistent.		/// The LoopInfo Analysis that is passed will be kept consistent.
///		///
/// This utility preserves LoopInfo. It will also preserve ScalarEvolution and		/// This utility preserves LoopInfo. It will also preserve ScalarEvolution and
/// DominatorTree if they are non-null.		/// DominatorTree if they are non-null.
bool llvm::UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,		bool llvm::UnrollLoop(Loop *L, unsigned Count, unsigned TripCount,
bool AllowRuntime, bool AllowExpensiveTripCount,		unsigned ForceMaxCount, bool AllowRuntime,
unsigned TripMultiple, LoopInfo LI, ScalarEvolution SE,		bool AllowExpensiveTripCount, unsigned TripMultiple,
DominatorTree DT, AssumptionCache AC,		LoopInfo LI, ScalarEvolution SE, DominatorTree *DT,
OptimizationRemarkEmitter *ORE, bool PreserveLCSSA) {		AssumptionCache AC, OptimizationRemarkEmitter ORE,
		bool PreserveLCSSA) {
BasicBlock *Preheader = L->getLoopPreheader();		BasicBlock *Preheader = L->getLoopPreheader();
if (!Preheader) {		if (!Preheader) {
DEBUG(dbgs() << " Can't unroll; loop preheader-insertion failed.\n");		DEBUG(dbgs() << " Can't unroll; loop preheader-insertion failed.\n");
return false;		return false;
}		}

BasicBlock *LatchBlock = L->getLoopLatch();		BasicBlock *LatchBlock = L->getLoopLatch();
if (!LatchBlock) {		if (!LatchBlock) {
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	DEBUG(
});		});
// Don't output the runtime loop remainder if Count is a multiple of		// Don't output the runtime loop remainder if Count is a multiple of
// TripMultiple. Such a remainder is never needed, and is unsafe if the loop		// TripMultiple. Such a remainder is never needed, and is unsafe if the loop
// contains a convergent instruction.		// contains a convergent instruction.
if (RuntimeTripCount && TripMultiple % Count != 0 &&		if (RuntimeTripCount && TripMultiple % Count != 0 &&
!UnrollRuntimeLoopRemainder(L, Count, AllowExpensiveTripCount,		!UnrollRuntimeLoopRemainder(L, Count, AllowExpensiveTripCount,
UnrollRuntimeEpilog, LI, SE, DT,		UnrollRuntimeEpilog, LI, SE, DT,
PreserveLCSSA)) {		PreserveLCSSA)) {
if (Force)		if (ForceMaxCount > 1) {
RuntimeTripCount = false;		RuntimeTripCount = false;
		Count = std::min(Count, ForceMaxCount);
		evstupacUnsubmitted Not Done Reply Inline Actions Is this possible to move "Count" change to /Scalar/ part? This will also rewrite unroll count set by user (pragma unroll or -unroll-count option). evstupac: Is this possible to move "Count" change to /Scalar/ part? This will also rewrite unroll count…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I am not sure what you mean. The patch does not change any previous behaviour for other targets than SystemZ. Please illustrate. jonpa: I am not sure what you mean. The patch does not change any previous behaviour for other targets…
		}
else		else
		evstupacUnsubmitted Done Reply Inline Actions Same. "else" with brackets is better here. evstupac: Same. "else" with brackets is better here.
		evstupacUnsubmitted Not Done Reply Inline Actions Even for SystemZ this is not good. And someone else in future could also use ForceMaxCount for his own architecture. It would be good not to change Count in many places. evstupac: Even for SystemZ this is not good. And someone else in future could also use ForceMaxCount for…
return false;		return false;
}		}

// Notify ScalarEvolution that the loop will be substantially changed,		// Notify ScalarEvolution that the loop will be substantially changed,
// if not outright eliminated.		// if not outright eliminated.
if (SE)		if (SE)
SE->forgetLoop(L);		SE->forgetLoop(L);

▲ Show 20 Lines • Show All 410 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroller] Replace UnrollingPreferences::Force with ForceMaxCount + SystemZ getUnrollingPreferences().ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 70991

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Transforms/Utils/UnrollLoop.h

lib/Target/SystemZ/SystemZTargetTransformInfo.h

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

lib/Transforms/Scalar/LoopUnrollPass.cpp

lib/Transforms/Utils/LoopUnroll.cpp

[LoopUnroller] Replace UnrollingPreferences::Force with ForceMaxCount + SystemZ getUnrollingPreferences().
ClosedPublic