This is an archive of the discontinued LLVM Phabricator instance.

Differential D19431

[LoopDist] Add llvm.loop.distribute.enable loop metadata
ClosedPublic

Authored by anemet on Apr 22 2016, 12:24 PM.

Download Raw Diff

Details

Reviewers

Commits

rGd2fa41471828: [LoopDist] Add llvm.loop.distribute.enable loop metadata
rL267672: [LoopDist] Add llvm.loop.distribute.enable loop metadata

Summary

D19403 adds a new pragma for loop distribution. This change adds
support for the corresponding metadata that the pragma is translated to
by the FE.

As part of this I had to rethink the flag -enable-loop-distribute. My
goal was to be backward compatible with the existing behavior:

A1. pass is off by default from the optimization pipeline
unless -enable-loop-distribute is specified

A2. pass is on when invoked directly from opt (e.g. for unit-testing)

The new pragma/metadata overrides these defaults so the new behavior is:

B1. A1 + enable distribution for individual loop with the pragma/metadata

B2. A2 + disable distribution for individual loop with the pragma/metadata

The default value whether the pass is on or off comes from the initiator
of the pass. From the PassManagerBuilder the default is off, from opt
it's on.

I moved -enable-loop-distribute under the pass. If the flag is
specified it overrides the default from above.

Then the pragma/metadata can further modifies this per loop.

As a side-effect, we can now also use -enable-loop-distribute=0 from opt
to emulate the default from the optimization pipeline. So to be precise
this is the new behavior:

C1. pass is off by default from the optimization pipeline
unless -enable-loop-distribute or the pragma/metadata enables it

C2. pass is on when invoked directly from opt
unless -enable-loop-distribute=0 or the pragma/metadata disables it

Diff Detail

Repository: rL LLVM

Event Timeline

anemet updated this revision to Diff 54702.Apr 22 2016, 12:24 PM

anemet retitled this revision from to [LoopDist] Add llvm.loop.distribute.enable loop metadata.

anemet updated this object.

anemet added a reviewer: hfinkel.

anemet added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, mehdi_amini. · View Herald TranscriptApr 22 2016, 12:24 PM

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

What is preventing us from enabling this pass by default?
Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

In D19431#412282, @hfinkel wrote:

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

In D19431#412519, @anemet wrote:

In D19431#412282, @hfinkel wrote:

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

Can you describe this in greater detail? There are definitely cases where we want to split loops based on available hardware prefetcher resource exhaustion, but I suspect that's a separate matter. Are you concerned here with finding likely high-latency loads and making sure we can hide them (to the extent possible) with other work?

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

This all makes sense. I suspect that we should really be doing the same thing for the vectorizer's interleaving.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

Makes sense.

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

I agree.

Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

For the vectorizer, we have a much higher limit on runtime checks when the user explicitly requests vectorization. Should we do the same here? [this is the only question here actually pertinent to this change; the other discussion can be moved elsewhere as desired].

In D19431#412744, @hfinkel wrote:

In D19431#412519, @anemet wrote:

In D19431#412282, @hfinkel wrote:

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

Can you describe this in greater detail? There are definitely cases where we want to split loops based on available hardware prefetcher resource exhaustion, but I suspect that's a separate matter. Are you concerned here with finding likely high-latency loads and making sure we can hide them (to the extent possible) with other work?

Yes that or overlapping misses from the different loads. If you split missing loads into separate loops they are now all exposed whereas originally some were executed in the shadow of other misses.

Are you also asking about the ILP case? I can quickly describe it here if you haven't seen the slides or the talk.

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

This all makes sense. I suspect that we should really be doing the same thing for the vectorizer's interleaving.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

Makes sense.

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

I agree.

Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

For the vectorizer, we have a much higher limit on runtime checks when the user explicitly requests vectorization. Should we do the same here? [this is the only question here actually pertinent to this change; the other discussion can be moved elsewhere as desired].

Currently there is no limit on the number memchecks we emit for loop distribution. That said I certainly agree that when we add the threshold we should drastically increase it with the pragma just like we do it for the vectorizer.

In D19431#412881, @anemet wrote:

In D19431#412744, @hfinkel wrote:

In D19431#412519, @anemet wrote:

In D19431#412282, @hfinkel wrote:

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

Can you describe this in greater detail? There are definitely cases where we want to split loops based on available hardware prefetcher resource exhaustion, but I suspect that's a separate matter. Are you concerned here with finding likely high-latency loads and making sure we can hide them (to the extent possible) with other work?

Yes that or overlapping misses from the different loads. If you split missing loads into separate loops they are now all exposed whereas originally some were executed in the shadow of other misses.

Are you also asking about the ILP case? I can quickly describe it here if you haven't seen the slides or the talk.

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

This all makes sense. I suspect that we should really be doing the same thing for the vectorizer's interleaving.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

Makes sense.

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

I agree.

Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

For the vectorizer, we have a much higher limit on runtime checks when the user explicitly requests vectorization. Should we do the same here? [this is the only question here actually pertinent to this change; the other discussion can be moved elsewhere as desired].

Currently there is no limit on the number memchecks we emit for loop distribution. That said I certainly agree that when we add the threshold we should drastically increase it with the pragma just like we do it for the vectorizer.

Isn't DistributeSCEVCheckThreshold just such a limit?

In D19431#412884, @hfinkel wrote:

Isn't DistributeSCEVCheckThreshold just such a limit?

Ah good point, I was overly focused on memchecks ;). Let me update the patch.

Are you OK with the overall reasoning that we should have dedicate loop distribution pragma/metadata?

Thanks for review.

Adam

In D19431#412892, @anemet wrote:

In D19431#412884, @hfinkel wrote:

Isn't DistributeSCEVCheckThreshold just such a limit?

Ah good point, I was overly focused on memchecks ;). Let me update the patch.

Are you OK with the overall reasoning that we should have dedicate loop distribution pragma/metadata?

Yes, I think this makes sense.

Thanks for review.

Adam

Added new SCEVCheckThreshold for the pragma enable case

Rebased on top of rL267643 which makes it easier to cache if distribution

was forced in the new per-loop class

LGTM.

docs/LangRef.rst
4709 ↗	(On Diff #55144)	You should probably add a sentence or two here, at least, explaining that "loop distribution" means in this context.

This revision is now accepted and ready to land.Apr 26 2016, 6:14 PM

anemet added inline comments.Apr 26 2016, 6:19 PM

docs/LangRef.rst
4709 ↗	(On Diff #55144)	Thanks, will do.

Closed by commit rL267672: [LoopDist] Add llvm.loop.distribute.enable loop metadata (authored by anemet). · Explain WhyApr 26 2016, 10:34 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

docs/

LangRef.rst

21 lines

include/

llvm/

Transforms/

Scalar.h

5 lines

lib/

Transforms/

IPO/

PassManagerBuilder.cpp

11 lines

Scalar/

LoopDistribute.cpp

74 lines

test/

Transforms/

LoopDistribute/

metadata.ll

149 lines

Diff 55164

llvm/trunk/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 4,705 Lines • ▼ Show 20 Lines
	This metadata indicates that the loop should not be versioned for the purpose			This metadata indicates that the loop should not be versioned for the purpose
	of enabling loop-invariant code motion (LICM). The metadata has a single operand			of enabling loop-invariant code motion (LICM). The metadata has a single operand
	which is the string ``llvm.loop.licm_versioning.disable``. For example:			which is the string ``llvm.loop.licm_versioning.disable``. For example:

	.. code-block:: llvm			.. code-block:: llvm

	!0 = !{!"llvm.loop.licm_versioning.disable"}			!0 = !{!"llvm.loop.licm_versioning.disable"}

				'``llvm.loop.distribute.enable``' Metadata
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				Loop distribution allows splitting a loop into multiple loops. Currently,
				this is only performed if the entire loop cannot be vectorized due to unsafe
				memory dependencies. The transformation will atempt to isolate the unsafe
				dependencies into their own loop.

				This metadata can be used to selectively enable or disable distribution of the
				loop. The first operand is the string ``llvm.loop.distribute.enable`` and the
				second operand is a bit. If the bit operand value is 1 distribution is
				enabled. A value of 0 disables distribution:

				.. code-block:: llvm

				!0 = !{!"llvm.loop.distribute.enable", i1 0}
				!1 = !{!"llvm.loop.distribute.enable", i1 1}

				This metadata should be used in conjunction with ``llvm.loop`` loop
				identification metadata.

	'``llvm.mem``'			'``llvm.mem``'
	^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^

	Metadata types used to annotate memory accesses with information helpful			Metadata types used to annotate memory accesses with information helpful
	for optimizations are prefixed with ``llvm.mem``.			for optimizations are prefixed with ``llvm.mem``.

	'``llvm.mem.parallel_loop_access``' Metadata			'``llvm.mem.parallel_loop_access``' Metadata
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	▲ Show 20 Lines • Show All 7,686 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Transforms/Scalar.h

	Show First 20 Lines • Show All 473 Lines • ▼ Show 20 Lines
	// NaryReassociate - Simplify n-ary operations by reassociation.			// NaryReassociate - Simplify n-ary operations by reassociation.
	//			//
	FunctionPass *createNaryReassociatePass();			FunctionPass *createNaryReassociatePass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LoopDistribute - Distribute loops.			// LoopDistribute - Distribute loops.
	//			//
	FunctionPass *createLoopDistributePass();			// ProcessAllLoopsByDefault instructs the pass to look for distribution
				// opportunities in all loops unless -enable-loop-distribute or the
				// llvm.loop.distribute.enable metadata data override this default.
				FunctionPass *createLoopDistributePass(bool ProcessAllLoopsByDefault);

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LoopLoadElimination - Perform loop-aware load elimination.			// LoopLoadElimination - Perform loop-aware load elimination.
	//			//
	FunctionPass *createLoopLoadEliminationPass();			FunctionPass *createLoopLoadEliminationPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	Show All 24 Lines

llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines
static cl::opt<bool>		static cl::opt<bool>
EnableMLSM("mlsm", cl::init(true), cl::Hidden,		EnableMLSM("mlsm", cl::init(true), cl::Hidden,
cl::desc("Enable motion of merged load and store"));		cl::desc("Enable motion of merged load and store"));

static cl::opt<bool> EnableLoopInterchange(		static cl::opt<bool> EnableLoopInterchange(
"enable-loopinterchange", cl::init(false), cl::Hidden,		"enable-loopinterchange", cl::init(false), cl::Hidden,
cl::desc("Enable the new, experimental LoopInterchange Pass"));		cl::desc("Enable the new, experimental LoopInterchange Pass"));

static cl::opt<bool> EnableLoopDistribute(
"enable-loop-distribute", cl::init(false), cl::Hidden,
cl::desc("Enable the new, experimental LoopDistribution Pass"));

static cl::opt<bool> EnableNonLTOGlobalsModRef(		static cl::opt<bool> EnableNonLTOGlobalsModRef(
"enable-non-lto-gmr", cl::init(true), cl::Hidden,		"enable-non-lto-gmr", cl::init(true), cl::Hidden,
cl::desc(		cl::desc(
"Enable the GlobalsModRef AliasAnalysis outside of the LTO pipeline."));		"Enable the GlobalsModRef AliasAnalysis outside of the LTO pipeline."));

static cl::opt<bool> EnableLoopLoadElim(		static cl::opt<bool> EnableLoopLoadElim(
"enable-loop-load-elim", cl::init(true), cl::Hidden,		"enable-loop-load-elim", cl::init(true), cl::Hidden,
cl::desc("Enable the LoopLoadElimination Pass"));		cl::desc("Enable the LoopLoadElimination Pass"));
▲ Show 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	void PassManagerBuilder::populateModulePassManager(
addExtensionsToPM(EP_VectorizerStart, MPM);		addExtensionsToPM(EP_VectorizerStart, MPM);

// Re-rotate loops in all our loop nests. These may have fallout out of		// Re-rotate loops in all our loop nests. These may have fallout out of
// rotated form due to GVN or other transformations, and the vectorizer relies		// rotated form due to GVN or other transformations, and the vectorizer relies
// on the rotated form. Disable header duplication at -Oz.		// on the rotated form. Disable header duplication at -Oz.
MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1));		MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1));

// Distribute loops to allow partial vectorization. I.e. isolate dependences		// Distribute loops to allow partial vectorization. I.e. isolate dependences
// into separate loop that would otherwise inhibit vectorization.		// into separate loop that would otherwise inhibit vectorization. This is
if (EnableLoopDistribute)		// currently only performed for loops marked with the metadata
MPM.add(createLoopDistributePass());		// llvm.loop.distribute=true or when -enable-loop-distribute is specified.
		MPM.add(createLoopDistributePass(/ProcessAllLoopsByDefault=/false));

MPM.add(createLoopVectorizePass(DisableUnrollLoops, LoopVectorize));		MPM.add(createLoopVectorizePass(DisableUnrollLoops, LoopVectorize));

// Eliminate loads by forwarding stores from the previous iteration to loads		// Eliminate loads by forwarding stores from the previous iteration to loads
// of the current iteration.		// of the current iteration.
if (EnableLoopLoadElim)		if (EnableLoopLoadElim)
MPM.add(createLoopLoadEliminationPass());		MPM.add(createLoopLoadEliminationPass());

▲ Show 20 Lines • Show All 368 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Scalar/LoopDistribute.cpp

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	cl::desc("Whether to distribute into a loop that may not be "
"if-convertible by the loop vectorizer"),		"if-convertible by the loop vectorizer"),
cl::init(false));		cl::init(false));

static cl::opt<unsigned> DistributeSCEVCheckThreshold(		static cl::opt<unsigned> DistributeSCEVCheckThreshold(
"loop-distribute-scev-check-threshold", cl::init(8), cl::Hidden,		"loop-distribute-scev-check-threshold", cl::init(8), cl::Hidden,
cl::desc("The maximum number of SCEV checks allowed for Loop "		cl::desc("The maximum number of SCEV checks allowed for Loop "
"Distribution"));		"Distribution"));

		static cl::opt<unsigned> PragmaDistributeSCEVCheckThreshold(
		"loop-distribute-scev-check-threshold-with-pragma", cl::init(128),
		cl::Hidden,
		cl::desc(
		"The maximum number of SCEV checks allowed for Loop "
		"Distribution for loop marked with #pragma loop distribute(enable)"));

		// Note that the initial value for this depends on whether the pass is invoked
		// directly or from the optimization pipeline.
		static cl::opt<bool> EnableLoopDistribute(
		"enable-loop-distribute", cl::Hidden,
		cl::desc("Enable the new, experimental LoopDistribution Pass"));

STATISTIC(NumLoopsDistributed, "Number of loops distributed");		STATISTIC(NumLoopsDistributed, "Number of loops distributed");

namespace {		namespace {
/// \brief Maintains the set of instructions of the loop for a partition before		/// \brief Maintains the set of instructions of the loop for a partition before
/// cloning. After cloning, it hosts the new loop.		/// cloning. After cloning, it hosts the new loop.
class InstPartition {		class InstPartition {
typedef SmallPtrSet<Instruction *, 8> InstructionSet;		typedef SmallPtrSet<Instruction *, 8> InstructionSet;

▲ Show 20 Lines • Show All 500 Lines • ▼ Show 20 Lines	private:
AccessesType Accesses;		AccessesType Accesses;
};		};

/// \brief The actual class performing the per-loop work.		/// \brief The actual class performing the per-loop work.
class LoopDistributeForLoop {		class LoopDistributeForLoop {
public:		public:
LoopDistributeForLoop(Loop L, LoopInfo LI, const LoopAccessInfo &LAI,		LoopDistributeForLoop(Loop L, LoopInfo LI, const LoopAccessInfo &LAI,
DominatorTree DT, ScalarEvolution SE)		DominatorTree DT, ScalarEvolution SE)
: L(L), LI(LI), LAI(LAI), DT(DT), SE(SE) {}		: L(L), LI(LI), LAI(LAI), DT(DT), SE(SE) {
		setForced();
		}

/// \brief Try to distribute an inner-most loop.		/// \brief Try to distribute an inner-most loop.
bool processLoop() {		bool processLoop() {
assert(L->empty() && "Only process inner loops.");		assert(L->empty() && "Only process inner loops.");

DEBUG(dbgs() << "\nLDist: In \"" << L->getHeader()->getParent()->getName()		DEBUG(dbgs() << "\nLDist: In \"" << L->getHeader()->getParent()->getName()
<< "\" checking " << *L << "\n");		<< "\" checking " << *L << "\n");

▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	if (Partitions.mergeToAvoidDuplicatedLoads()) {
DEBUG(dbgs() << "\nPartitions merged to ensure unique loads:\n"		DEBUG(dbgs() << "\nPartitions merged to ensure unique loads:\n"
<< Partitions);		<< Partitions);
if (Partitions.getSize() < 2)		if (Partitions.getSize() < 2)
return false;		return false;
}		}

// Don't distribute the loop if we need too many SCEV run-time checks.		// Don't distribute the loop if we need too many SCEV run-time checks.
const SCEVUnionPredicate &Pred = LAI.PSE.getUnionPredicate();		const SCEVUnionPredicate &Pred = LAI.PSE.getUnionPredicate();
if (Pred.getComplexity() > DistributeSCEVCheckThreshold) {		if (Pred.getComplexity() > (IsForced.getValueOr(false)
		? PragmaDistributeSCEVCheckThreshold
		: DistributeSCEVCheckThreshold)) {
DEBUG(dbgs() << "Too many SCEV run-time checks needed.\n");		DEBUG(dbgs() << "Too many SCEV run-time checks needed.\n");
return false;		return false;
}		}

DEBUG(dbgs() << "\nDistributing loop: " << *L << "\n");		DEBUG(dbgs() << "\nDistributing loop: " << *L << "\n");
// We're done forming the partitions set up the reverse mapping from		// We're done forming the partitions set up the reverse mapping from
// instructions to partitions.		// instructions to partitions.
Partitions.setupPartitionIdOnInstructions();		Partitions.setupPartitionIdOnInstructions();
Show All 35 Lines	if (LDistVerify) {
LI->verify();		LI->verify();
DT->verifyDomTree();		DT->verifyDomTree();
}		}

++NumLoopsDistributed;		++NumLoopsDistributed;
return true;		return true;
}		}

		/// \brief Return if distribution forced to be enabled/disabled for the loop.
		///
		/// If the optional has a value, it indicates whether distribution was forced
		/// to be enabled (true) or disabled (false). If the optional has no value
		/// distribution was not forced either way.
		const Optional<bool> &isForced() const { return IsForced; }

private:		private:
/// \brief Filter out checks between pointers from the same partition.		/// \brief Filter out checks between pointers from the same partition.
///		///
/// \p PtrToPartition contains the partition number for pointers. Partition		/// \p PtrToPartition contains the partition number for pointers. Partition
/// number -1 means that the pointer is used in multiple partitions. In this		/// number -1 means that the pointer is used in multiple partitions. In this
/// case we can't safely omit the check.		/// case we can't safely omit the check.
SmallVector<RuntimePointerChecking::PointerCheck, 4>		SmallVector<RuntimePointerChecking::PointerCheck, 4>
includeOnlyCrossPartitionChecks(		includeOnlyCrossPartitionChecks(
Show All 24 Lines	std::copy_if(AllChecks.begin(), AllChecks.end(), std::back_inserter(Checks),
PtrToPartition, PtrIdx1, PtrIdx2))		PtrToPartition, PtrIdx1, PtrIdx2))
return true;		return true;
return false;		return false;
});		});

return Checks;		return Checks;
}		}

		/// \brief Check whether the loop metadata is forcing distribution to be
		/// enabled/disabled.
		void setForced() {
		Optional<const MDOperand *> Value =
		findStringMetadataForLoop(L, "llvm.loop.distribute.enable");
		if (!Value)
		return;

		const MDOperand Op = Value;
		assert(Op && mdconst::hasa<ConstantInt>(*Op) && "invalid metadata");
		IsForced = mdconst::extract<ConstantInt>(*Op)->getZExtValue();
		}

// Analyses used.		// Analyses used.
Loop *L;		Loop *L;
LoopInfo *LI;		LoopInfo *LI;
const LoopAccessInfo &LAI;		const LoopAccessInfo &LAI;
DominatorTree *DT;		DominatorTree *DT;
ScalarEvolution *SE;		ScalarEvolution *SE;

		/// \brief Indicates whether distribution is forced to be enabled/disabled for
		/// the loop.
		///
		/// If the optional has a value, it indicates whether distribution was forced
		/// to be enabled (true) or disabled (false). If the optional has no value
		/// distribution was not forced either way.
		Optional<bool> IsForced;
};		};

/// \brief The pass class.		/// \brief The pass class.
class LoopDistribute : public FunctionPass {		class LoopDistribute : public FunctionPass {
public:		public:
LoopDistribute() : FunctionPass(ID) {		/// \p ProcessAllLoopsByDefault specifies whether loop distribution should be
		/// performed by default. Pass -enable-loop-distribute={0,1} overrides this
		/// default. We use this to keep LoopDistribution off by default when invoked
		/// from the optimization pipeline but on when invoked explicitly from opt.
		LoopDistribute(bool ProcessAllLoopsByDefault = true)
		: FunctionPass(ID), ProcessAllLoops(ProcessAllLoopsByDefault) {
		// The default is set by the caller.
		if (EnableLoopDistribute.getNumOccurrences() > 0)
		ProcessAllLoops = EnableLoopDistribute;
initializeLoopDistributePass(*PassRegistry::getPassRegistry());		initializeLoopDistributePass(*PassRegistry::getPassRegistry());
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
auto *LAA = &getAnalysis<LoopAccessAnalysis>();		auto *LAA = &getAnalysis<LoopAccessAnalysis>();
auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
auto *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();		auto *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
Show All 9 Lines	for (Loop TopLevelLoop : LI)
if (L->empty())		if (L->empty())
Worklist.push_back(L);		Worklist.push_back(L);

// Now walk the identified inner loops.		// Now walk the identified inner loops.
bool Changed = false;		bool Changed = false;
for (Loop *L : Worklist) {		for (Loop *L : Worklist) {
const LoopAccessInfo &LAI = LAA->getInfo(L, ValueToValueMap());		const LoopAccessInfo &LAI = LAA->getInfo(L, ValueToValueMap());
LoopDistributeForLoop LDL(L, LI, LAI, DT, SE);		LoopDistributeForLoop LDL(L, LI, LAI, DT, SE);

		// If distribution was forced for the specific loop to be
		// enabled/disabled, follow that. Otherwise use the global flag.
		if (LDL.isForced().getValueOr(ProcessAllLoops))
Changed \|= LDL.processLoop();		Changed \|= LDL.processLoop();
}		}

// Process each loop nest in the function.		// Process each loop nest in the function.
return Changed;		return Changed;
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<ScalarEvolutionWrapperPass>();		AU.addRequired<ScalarEvolutionWrapperPass>();
AU.addRequired<LoopInfoWrapperPass>();		AU.addRequired<LoopInfoWrapperPass>();
AU.addPreserved<LoopInfoWrapperPass>();		AU.addPreserved<LoopInfoWrapperPass>();
AU.addRequired<LoopAccessAnalysis>();		AU.addRequired<LoopAccessAnalysis>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
}		}

static char ID;		static char ID;

		private:
		/// \brief Whether distribution should be on in this function. The per-loop
		/// pragma can override this.
		bool ProcessAllLoops;
};		};
} // anonymous namespace		} // anonymous namespace

char LoopDistribute::ID;		char LoopDistribute::ID;
static const char ldist_name[] = "Loop Distribition";		static const char ldist_name[] = "Loop Distribition";

INITIALIZE_PASS_BEGIN(LoopDistribute, LDIST_NAME, ldist_name, false, false)		INITIALIZE_PASS_BEGIN(LoopDistribute, LDIST_NAME, ldist_name, false, false)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopAccessAnalysis)		INITIALIZE_PASS_DEPENDENCY(LoopAccessAnalysis)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)		INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
INITIALIZE_PASS_END(LoopDistribute, LDIST_NAME, ldist_name, false, false)		INITIALIZE_PASS_END(LoopDistribute, LDIST_NAME, ldist_name, false, false)

namespace llvm {		namespace llvm {
FunctionPass *createLoopDistributePass() { return new LoopDistribute(); }		FunctionPass *createLoopDistributePass(bool ProcessAllLoopsByDefault) {
		return new LoopDistribute(ProcessAllLoopsByDefault);
		}
}		}

llvm/trunk/test/Transforms/LoopDistribute/metadata.ll

				; RUN: opt -basicaa -loop-distribute -enable-loop-distribute=0 -S < %s \| FileCheck %s --check-prefix=CHECK --check-prefix=EXPLICIT --check-prefix=DEFAULT_OFF
				; RUN: opt -basicaa -loop-distribute -enable-loop-distribute=1 -S < %s \| FileCheck %s --check-prefix=CHECK --check-prefix=EXPLICIT --check-prefix=DEFAULT_ON

				; Same loop as in basic.ll. Check that distribution is enabled/disabled
				; properly according to -enable-loop-distribute=0/1 and the
				; llvm.loop.distribute.enable metadata.

				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-macosx10.10.0"

				; CHECK-LABEL: @explicit_on(
				define void @explicit_on(i32* noalias %a,
				i32* noalias %b,
				i32* noalias %c,
				i32* noalias %d,
				i32* noalias %e) {
				entry:
				br label %for.body

				; EXPLICIT: for.body.ldist1:

				for.body: ; preds = %for.body, %entry
				%ind = phi i64 [ 0, %entry ], [ %add, %for.body ]

				%arrayidxA = getelementptr inbounds i32, i32* %a, i64 %ind
				%loadA = load i32, i32* %arrayidxA, align 4

				%arrayidxB = getelementptr inbounds i32, i32* %b, i64 %ind
				%loadB = load i32, i32* %arrayidxB, align 4

				%mulA = mul i32 %loadB, %loadA

				%add = add nuw nsw i64 %ind, 1
				%arrayidxA_plus_4 = getelementptr inbounds i32, i32* %a, i64 %add
				store i32 %mulA, i32* %arrayidxA_plus_4, align 4

				%arrayidxD = getelementptr inbounds i32, i32* %d, i64 %ind
				%loadD = load i32, i32* %arrayidxD, align 4

				%arrayidxE = getelementptr inbounds i32, i32* %e, i64 %ind
				%loadE = load i32, i32* %arrayidxE, align 4

				%mulC = mul i32 %loadD, %loadE

				%arrayidxC = getelementptr inbounds i32, i32* %c, i64 %ind
				store i32 %mulC, i32* %arrayidxC, align 4

				%exitcond = icmp eq i64 %add, 20
				br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !0

				for.end: ; preds = %for.body
				ret void
				}

				; CHECK-LABEL: @explicit_off(
				define void @explicit_off(i32* noalias %a,
				i32* noalias %b,
				i32* noalias %c,
				i32* noalias %d,
				i32* noalias %e) {
				entry:
				br label %for.body

				; EXPLICIT-NOT: for.body.ldist1:

				for.body: ; preds = %for.body, %entry
				%ind = phi i64 [ 0, %entry ], [ %add, %for.body ]

				%arrayidxA = getelementptr inbounds i32, i32* %a, i64 %ind
				%loadA = load i32, i32* %arrayidxA, align 4

				%arrayidxB = getelementptr inbounds i32, i32* %b, i64 %ind
				%loadB = load i32, i32* %arrayidxB, align 4

				%mulA = mul i32 %loadB, %loadA

				%add = add nuw nsw i64 %ind, 1
				%arrayidxA_plus_4 = getelementptr inbounds i32, i32* %a, i64 %add
				store i32 %mulA, i32* %arrayidxA_plus_4, align 4

				%arrayidxD = getelementptr inbounds i32, i32* %d, i64 %ind
				%loadD = load i32, i32* %arrayidxD, align 4

				%arrayidxE = getelementptr inbounds i32, i32* %e, i64 %ind
				%loadE = load i32, i32* %arrayidxE, align 4

				%mulC = mul i32 %loadD, %loadE

				%arrayidxC = getelementptr inbounds i32, i32* %c, i64 %ind
				store i32 %mulC, i32* %arrayidxC, align 4

				%exitcond = icmp eq i64 %add, 20
				br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !2

				for.end: ; preds = %for.body
				ret void
				}

				; CHECK-LABEL: @default_distribute(
				define void @default_distribute(i32* noalias %a,
				i32* noalias %b,
				i32* noalias %c,
				i32* noalias %d,
				i32* noalias %e) {
				entry:
				br label %for.body

				; Verify the two distributed loops.

				; DEFAULT_ON: for.body.ldist1:
				; DEFAULT_OFF-NOT: for.body.ldist1:

				for.body: ; preds = %for.body, %entry
				%ind = phi i64 [ 0, %entry ], [ %add, %for.body ]

				%arrayidxA = getelementptr inbounds i32, i32* %a, i64 %ind
				%loadA = load i32, i32* %arrayidxA, align 4

				%arrayidxB = getelementptr inbounds i32, i32* %b, i64 %ind
				%loadB = load i32, i32* %arrayidxB, align 4

				%mulA = mul i32 %loadB, %loadA

				%add = add nuw nsw i64 %ind, 1
				%arrayidxA_plus_4 = getelementptr inbounds i32, i32* %a, i64 %add
				store i32 %mulA, i32* %arrayidxA_plus_4, align 4

				%arrayidxD = getelementptr inbounds i32, i32* %d, i64 %ind
				%loadD = load i32, i32* %arrayidxD, align 4

				%arrayidxE = getelementptr inbounds i32, i32* %e, i64 %ind
				%loadE = load i32, i32* %arrayidxE, align 4

				%mulC = mul i32 %loadD, %loadE

				%arrayidxC = getelementptr inbounds i32, i32* %c, i64 %ind
				store i32 %mulC, i32* %arrayidxC, align 4

				%exitcond = icmp eq i64 %add, 20
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

				!0 = distinct !{!0, !1}
				!1 = !{!"llvm.loop.distribute.enable", i1 true}
				!2 = distinct !{!2, !3}
				!3 = !{!"llvm.loop.distribute.enable", i1 false}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopDist] Add llvm.loop.distribute.enable loop metadataClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 55164

llvm/trunk/docs/LangRef.rst

llvm/trunk/include/llvm/Transforms/Scalar.h

llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp

llvm/trunk/lib/Transforms/Scalar/LoopDistribute.cpp

llvm/trunk/test/Transforms/LoopDistribute/metadata.ll

[LoopDist] Add llvm.loop.distribute.enable loop metadata
ClosedPublic