This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorizer] Account for dependence cycles to fix PR25108
Needs ReviewPublic

Authored by Ayal on Mar 30 2016, 4:24 PM.

Download Raw Diff

This revision needs review, but there are no reviewers specified.

Details

Reviewers: None

Summary

This patch fixes PR25108 and "fails" a couple of testcases (see below), but is yet to help a real workload to justify committing. Posted in hope it may become useful.

The patch identifies scalar instructions that lie on data dependence cycles and boosts the cost of SLP tree entries that contain such scalar instructions, thereby preventing it from being vectorized. A more precise solution would check how much of the latency across such dependence cycles is hidden by ILP, considering the VF.

The patch contains 3 parts, which may be of independent interest:

Resurrecting DataFlow.h that was deleted by Chandler two years ago for lack of use (r202825). It is employed here to provide a use-def dependence graph over Values. In doing so we’re ignoring memory dependences, which do not appear in this PR, but may require attention in the future. The appropriate location is IR/DataFlow.h rather than under Support; this patch simply resurrects the file, to be moved in a separate commit if desired.

Enhancing scc_iterator in order to iteratively compute strongly-connected components in the data-dependence graph, starting from the scalar instructions at the root of the SLP tree (e.g., a vector of stores). The current scc_iterator works from a single entry node; if multiple entry nodes are to be scanned, an artificial entry node can be used as done in FunctionAttrs.cpp/SyntheticRoot. In our case we want to repeatedly start from a subset of entry nodes, record the nodes found to lie on cycles, and refrain from scanning parts of the graph multiple times. This is accomplished by providing scc_iterator with an optional argument holding nodes that have been visited already and should not be revisited. This extension, which follows Tarjan’s original algorithm, can be used to simplify FunctionAttrs.cpp; doing so deserves a separate patch.

The SLP vectorizer adjusts the cost of each to-be-vector-instruction that contains a cyclic scalar instruction. We try to save compile time by computing SCC’s only for SLP trees that are about to be vectorized based on their original total cost, and recalculate this cost on demand.

The patch causes 3 tests to fail as they no longer get vectorized: Transforms/SLPVectorizer/X86/{cycle_dup.ll, external_user.ll, phi.ll}. Testing the first reveals that it’s indeed 15% faster with -fno-slp-vectorize. To circumvent the cost model from interfering with these tests, they should include -force-vector-width=4, but that only applies to the loop vectorizer. Forcing the slp vectorizer may deserve separate attention.

Diff Detail

Event Timeline

Ayal updated this revision to Diff 52147.Mar 30 2016, 4:24 PM

Ayal retitled this revision from to [SLPVectorizer] Account for dependence cycles to fix PR25108.

Ayal updated this object.

Ayal added reviewers: dorit, gilr.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMar 30 2016, 4:24 PM

Ayal updated this revision to Diff 62633.Jul 3 2016, 12:29 PM

Ayal updated this object.

Ayal removed reviewers: gilr, dorit.

Ayal added subscribers: mkuper, nadav, anemet and 5 others.

Ayal, I commented on the bug report that I don't understand why this heuristic is useful in the general case (for loops that are not this specific loop). Are you seeing any speedups on SPEC or the LLVM test suite (or other test suites?).

zinovy.nis added a subscriber: zinovy.nis.Jul 7 2016, 2:40 AM

Revision Contents

Path

Size

include/

llvm/

ADT/

SCCIterator.h

35 lines

Support/

DataFlow.h

104 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

40 lines

Diff 52147

include/llvm/ADT/SCCIterator.h

Show All 20 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_ADT_SCCITERATOR_H		#ifndef LLVM_ADT_SCCITERATOR_H
#define LLVM_ADT_SCCITERATOR_H		#define LLVM_ADT_SCCITERATOR_H

#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/GraphTraits.h"		#include "llvm/ADT/GraphTraits.h"
#include "llvm/ADT/iterator.h"		#include "llvm/ADT/iterator.h"
		#include <set>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {

/// \brief Enumerate the SCCs of a directed graph in reverse topological order		/// \brief Enumerate the SCCs of a directed graph in reverse topological order
/// of the SCC DAG.		/// of the SCC DAG.
///		///
/// This is implemented using Tarjan's DFS algorithm using an internal stack to		/// This is implemented using Tarjan's DFS algorithm using an internal stack to
Show All 37 Lines	class scc_iterator

/// The current SCC, retrieved using operator*().		/// The current SCC, retrieved using operator*().
SccTy CurrentSCC;		SccTy CurrentSCC;

/// DFS stack, Used to maintain the ordering. The top contains the current		/// DFS stack, Used to maintain the ordering. The top contains the current
/// node, the next child to visit, and the minimum uplink value of all child		/// node, the next child to visit, and the minimum uplink value of all child
std::vector<StackElement> VisitStack;		std::vector<StackElement> VisitStack;

		std::set<NodeType > DontRevisitNodes;

/// A single "visit" within the non-recursive DFS traversal.		/// A single "visit" within the non-recursive DFS traversal.
void DFSVisitOne(NodeType *N);		void DFSVisitOne(NodeType *N);

/// The stack-based DFS traversal; defined below.		/// The stack-based DFS traversal; defined below.
void DFSVisitChildren();		void DFSVisitChildren();

/// Compute the next SCC using the DFS traversal.		/// Compute the next SCC using the DFS traversal.
void GetNextSCC();		void GetNextSCC();

scc_iterator(NodeType *entryN) : visitNum(0) {		scc_iterator(NodeType entryN, std::set<NodeType > *DRN = nullptr) :
		visitNum(0), DontRevisitNodes(DRN) {
DFSVisitOne(entryN);		DFSVisitOne(entryN);
GetNextSCC();		GetNextSCC();
}		}

/// End is when the DFS stack is empty.		/// End is when the DFS stack is empty.
scc_iterator() {}		scc_iterator() {}

public:		public:
static scc_iterator begin(const GraphT &G) {		static scc_iterator begin(const GraphT &G,
return scc_iterator(GT::getEntryNode(G));		std::set<NodeType > DontRevisitNodes = nullptr) {
		return scc_iterator(GT::getEntryNode(G), DontRevisitNodes);
}		}
static scc_iterator end(const GraphT &) { return scc_iterator(); }		static scc_iterator end(const GraphT &) { return scc_iterator(); }

/// \brief Direct loop termination test which is more efficient than		/// \brief Direct loop termination test which is more efficient than
/// comparison with \c end().		/// comparison with \c end().
bool isAtEnd() const {		bool isAtEnd() const {
assert(!CurrentSCC.empty() \|\| VisitStack.empty());		assert(!CurrentSCC.empty() \|\| VisitStack.empty());
return CurrentSCC.empty();		return CurrentSCC.empty();
Show All 20 Lines	public:
bool hasLoop() const;		bool hasLoop() const;

/// This informs the \c scc_iterator that the specified \c Old node		/// This informs the \c scc_iterator that the specified \c Old node
/// has been deleted, and \c New is to be used in its place.		/// has been deleted, and \c New is to be used in its place.
void ReplaceNode(NodeType Old, NodeType New) {		void ReplaceNode(NodeType Old, NodeType New) {
assert(nodeVisitNumbers.count(Old) && "Old not in scc_iterator?");		assert(nodeVisitNumbers.count(Old) && "Old not in scc_iterator?");
nodeVisitNumbers[New] = nodeVisitNumbers[Old];		nodeVisitNumbers[New] = nodeVisitNumbers[Old];
nodeVisitNumbers.erase(Old);		nodeVisitNumbers.erase(Old);
		if (DontRevisitNodes && DontRevisitNodes->count(Old)) {
		DontRevisitNodes->erase(Old);
		DontRevisitNodes->insert(New);
		}
}		}
};		};

template <class GraphT, class GT>		template <class GraphT, class GT>
void scc_iterator<GraphT, GT>::DFSVisitOne(NodeType *N) {		void scc_iterator<GraphT, GT>::DFSVisitOne(NodeType *N) {
		if (DontRevisitNodes && DontRevisitNodes->count(N))
		return;
		if (DontRevisitNodes)
		DontRevisitNodes->insert(N);
++visitNum;		++visitNum;
nodeVisitNumbers[N] = visitNum;		nodeVisitNumbers[N] = visitNum;
SCCNodeStack.push_back(N);		SCCNodeStack.push_back(N);
VisitStack.push_back(StackElement(N, GT::child_begin(N), visitNum));		VisitStack.push_back(StackElement(N, GT::child_begin(N), visitNum));
#if 0 // Enable if needed when debugging.		#if 0 // Enable if needed when debugging.
dbgs() << "TarjanSCC: Node " << N <<		dbgs() << "TarjanSCC: Node " << N <<
" : visitNum = " << visitNum << "\n";		" : visitNum = " << visitNum << "\n";
#endif		#endif
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	bool scc_iterator<GraphT, GT>::hasLoop() const {
for (ChildItTy CI = GT::child_begin(N), CE = GT::child_end(N); CI != CE;		for (ChildItTy CI = GT::child_begin(N), CE = GT::child_end(N); CI != CE;
++CI)		++CI)
if (*CI == N)		if (*CI == N)
return true;		return true;
return false;		return false;
}		}

/// \brief Construct the begin iterator for a deduced graph type T.		/// \brief Construct the begin iterator for a deduced graph type T.
template <class T> scc_iterator<T> scc_begin(const T &G) {		// template <class T, class GT = GraphTraits<T>>
return scc_iterator<T>::begin(G);		template <class T>
		scc_iterator<T> scc_begin(
		const T &G,
		std::set<typename GraphTraits<T>::NodeType > DontRevisitNodes = nullptr) {
		return scc_iterator<T>::begin(G, DontRevisitNodes);
}		}

/// \brief Construct the end iterator for a deduced graph type T.		/// \brief Construct the end iterator for a deduced graph type T.
template <class T> scc_iterator<T> scc_end(const T &G) {		template <class T> scc_iterator<T> scc_end(const T &G) {
return scc_iterator<T>::end(G);		return scc_iterator<T>::end(G);
}		}

/// \brief Construct the begin iterator for a deduced graph type T's Inverse<T>.		/// \brief Construct the begin iterator for a deduced graph type T's Inverse<T>.
template <class T> scc_iterator<Inverse<T> > scc_begin(const Inverse<T> &G) {		template <class T>
return scc_iterator<Inverse<T> >::begin(G);		scc_iterator<Inverse<T>> scc_begin(
		const Inverse<T> &G,
		std::set<typename GraphTraits<Inverse<T>>::NodeType > DontRevisitNodes =
		nullptr) {
		return scc_iterator<Inverse<T> >::begin(G, DontRevisitNodes);
}		}

/// \brief Construct the end iterator for a deduced graph type T's Inverse<T>.		/// \brief Construct the end iterator for a deduced graph type T's Inverse<T>.
template <class T> scc_iterator<Inverse<T> > scc_end(const Inverse<T> &G) {		template <class T> scc_iterator<Inverse<T> > scc_end(const Inverse<T> &G) {
return scc_iterator<Inverse<T> >::end(G);		return scc_iterator<Inverse<T> >::end(G);
}		}

} // End llvm namespace		} // End llvm namespace

#endif		#endif

include/llvm/Support/DataFlow.h

				//===-- llvm/Support/DataFlow.h - dataflow as graphs ------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file defines specializations of GraphTraits that allows Use-Def and
				// Def-Use relations to be treated as proper graphs for generic algorithms.
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_SUPPORT_DATAFLOW_H
				#define LLVM_SUPPORT_DATAFLOW_H

				#include "llvm/IR/User.h"
				#include "llvm/IR/Value.h"
				#include "llvm/ADT/GraphTraits.h"

				namespace llvm {

				//===----------------------------------------------------------------------===//
				// Provide specializations of GraphTraits to be able to treat def-use/use-def
				// chains as graphs

				template <> struct GraphTraits<const Value*> {
				typedef const Value NodeType;
				typedef Value::const_user_iterator ChildIteratorType;

				static NodeType getEntryNode(const Value G) {
				return G;
				}

				static inline ChildIteratorType child_begin(NodeType *N) {
				return N->user_begin();
				}

				static inline ChildIteratorType child_end(NodeType *N) {
				return N->user_end();
				}
				};

				template <> struct GraphTraits<Value*> {
				typedef Value NodeType;
				typedef Value::user_iterator ChildIteratorType;

				static NodeType getEntryNode(Value G) {
				return G;
				}

				static inline ChildIteratorType child_begin(NodeType *N) {
				return N->user_begin();
				}

				static inline ChildIteratorType child_end(NodeType *N) {
				return N->user_end();
				}
				};

				template <> struct GraphTraits<Inverse<const User*> > {
				typedef const Value NodeType;
				typedef User::const_op_iterator ChildIteratorType;

				static NodeType getEntryNode(Inverse<const User> G) {
				return G.Graph;
				}

				static inline ChildIteratorType child_begin(NodeType *N) {
				if (const User *U = dyn_cast<User>(N))
				return U->op_begin();
				return NULL;
				}

				static inline ChildIteratorType child_end(NodeType *N) {
				if(const User *U = dyn_cast<User>(N))
				return U->op_end();
				return NULL;
				}
				};

				template <> struct GraphTraits<Inverse<User*> > {
				typedef Value NodeType;
				typedef User::op_iterator ChildIteratorType;

				static NodeType getEntryNode(Inverse<User> G) {
				return G.Graph;
				}

				static inline ChildIteratorType child_begin(NodeType *N) {
				if (User *U = dyn_cast<User>(N))
				return U->op_begin();
				return NULL;
				}

				static inline ChildIteratorType child_end(NodeType *N) {
				if (User *U = dyn_cast<User>(N))
				return U->op_end();
				return NULL;
				}
				};

				}
				#endif

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show All 12 Lines
//		//
// The pass is inspired by the work described in the paper:		// The pass is inspired by the work described in the paper:
// "Loop-Aware SLP in GCC" by Ira Rosen, Dorit Nuzman, Ayal Zaks.		// "Loop-Aware SLP in GCC" by Ira Rosen, Dorit Nuzman, Ayal Zaks.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/PostOrderIterator.h"		#include "llvm/ADT/PostOrderIterator.h"
		#include "llvm/ADT/SCCIterator.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/DemandedBits.h"		#include "llvm/Analysis/DemandedBits.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/LoopAccessAnalysis.h"		#include "llvm/Analysis/LoopAccessAnalysis.h"
Show All 11 Lines
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/IR/NoFolder.h"		#include "llvm/IR/NoFolder.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
#include "llvm/IR/Verifier.h"		#include "llvm/IR/Verifier.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
		#include "llvm/Support/DataFlow.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include <algorithm>		#include <algorithm>
#include <map>		#include <map>
#include <memory>		#include <memory>

using namespace llvm;		using namespace llvm;
▲ Show 20 Lines • Show All 372 Lines • ▼ Show 20 Lines	public:
void computeMinimumValueSizes();		void computeMinimumValueSizes();

private:		private:
struct TreeEntry;		struct TreeEntry;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

		/// Scan use-def dependence graph to find and mark all scalar instructions
		/// that belong to the tree and lie on dependence cycles.
		void markCyclicInstructions();

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree.
Value vectorizeTree(TreeEntry E);		Value vectorizeTree(TreeEntry E);

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	private:
std::vector<TreeEntry> VectorizableTree;		std::vector<TreeEntry> VectorizableTree;

/// Maps a specific scalar to its tree entry.		/// Maps a specific scalar to its tree entry.
SmallDenseMap<Value*, int> ScalarToTreeEntry;		SmallDenseMap<Value*, int> ScalarToTreeEntry;

/// A list of scalars that we found that we need to keep as scalars.		/// A list of scalars that we found that we need to keep as scalars.
ValueSet MustGather;		ValueSet MustGather;

		/// A list of scalars that we found lying on a dependence cycle.
		ValueSet OnDependenceCycle;

		/// A list of scalars that we already visited looking for cycles.
		std::set<Value *> VisitedInstructions;

/// This POD struct describes one external user in the vectorized tree.		/// This POD struct describes one external user in the vectorized tree.
struct ExternalUser {		struct ExternalUser {
ExternalUser (Value S, llvm::User U, int L) :		ExternalUser (Value S, llvm::User U, int L) :
Scalar(S), User(U), Lane(L){}		Scalar(S), User(U), Lane(L){}
// Which scalar in our function.		// Which scalar in our function.
Value *Scalar;		Value *Scalar;
// Which user that uses the scalar.		// Which user that uses the scalar.
llvm::User *User;		llvm::User *User;
▲ Show 20 Lines • Show All 460 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Lane << " from " << *Scalar << ".\n");		Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, Lane));		ExternalUses.push_back(ExternalUser(Scalar, U, Lane));
}		}
}		}
}		}
}		}


		void BoUpSLP::markCyclicInstructions() {
		DEBUG(dbgs() << "SLP: Marking cyclic instructions.\n");
		ArrayRef<Value*> VL = VectorizableTree[0].Scalars;
		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
		Inverse<User *> VLi = cast<User>(VL[i]);
		DEBUG(dbgs() << "SLP: checking instruction (" << *VL[i] << ").\n");
		for (scc_iterator<Inverse<User *>> I = scc_begin(VLi, &VisitedInstructions);
		!I.isAtEnd(); ++I) {
		const std::vector<Value > &ValueSCC = I;
		for (auto Val : ValueSCC)
		DEBUG(dbgs() << "SLP: in scc (" << *Val << ").\n");
		if (ValueSCC.size() > 1)
		OnDependenceCycle.insert(ValueSCC.begin(), ValueSCC.end());
		}
		}
		}


void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {
bool SameTy = getSameType(VL); (void)SameTy;		bool SameTy = getSameType(VL); (void)SameTy;
bool isAltShuffle = false;		bool isAltShuffle = false;
assert(SameTy && "Invalid types!");		assert(SameTy && "Invalid types!");

if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false);
▲ Show 20 Lines • Show All 781 Lines • ▼ Show 20 Lines	int BoUpSLP::getTreeCost() {
// We only vectorize tiny trees if it is fully vectorizable.		// We only vectorize tiny trees if it is fully vectorizable.
if (VectorizableTree.size() < 3 && !isFullyVectorizableTinyTree()) {		if (VectorizableTree.size() < 3 && !isFullyVectorizableTinyTree()) {
if (VectorizableTree.empty()) {		if (VectorizableTree.empty()) {
assert(!ExternalUses.size() && "We should not have any external users");		assert(!ExternalUses.size() && "We should not have any external users");
}		}
return INT_MAX;		return INT_MAX;
}		}

		markCyclicInstructions();

unsigned BundleWidth = VectorizableTree[0].Scalars.size();		unsigned BundleWidth = VectorizableTree[0].Scalars.size();

for (TreeEntry &TE : VectorizableTree) {		for (TreeEntry &TE : VectorizableTree) {
int C = getEntryCost(&TE);		int C = getEntryCost(&TE);
		if (C < 0 && !OnDependenceCycle.empty()) {
		for (int Lane = 0, LE = TE.Scalars.size(); Lane != LE; ++Lane)
		if (OnDependenceCycle.count(TE.Scalars[Lane])) {
		DEBUG(dbgs() << "SLP: Instruction found to lie on cycle.\n");
		C = 0;
		break;
		}
		}
DEBUG(dbgs() << "SLP: Adding cost " << C << " for bundle that starts with "		DEBUG(dbgs() << "SLP: Adding cost " << C << " for bundle that starts with "
<< *TE.Scalars[0] << ".\n");		<< *TE.Scalars[0] << ".\n");
Cost += C;		Cost += C;
}		}

SmallSet<Value *, 16> ExtractCostCalculated;		SmallSet<Value *, 16> ExtractCostCalculated;
int ExtractCost = 0;		int ExtractCost = 0;
for (ExternalUser &EU : ExternalUses) {		for (ExternalUser &EU : ExternalUses) {
▲ Show 20 Lines • Show All 2,762 Lines • Show Last 20 Lines