This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Shrink integer operations into the smallest type possible
ClosedPublic

Authored by jmolloy on Oct 8 2015, 5:58 AM.

Download Raw Diff

Details

Reviewers

anemet
gberry
mssimpso
hfinkel
sbaranga

Summary

C semantics force sub-int-sized values (e.g. i8, i16) to be promoted to int
type (e.g. i32) whenever arithmetic is performed on them.

For targets with native i8 or i16 operations, usually InstCombine can shrink
the arithmetic type down again. However InstCombine refuses to create illegal
types, so for targets without i8 or i16 registers, the lengthening and
shrinking remains.

Most SIMD ISAs (e.g. NEON) however support vectors of i8 or i16 even when
their scalar equivalents do not, so during vectorization it is important to
remove these lengthens and truncates when deciding the profitability of
vectorization.

The algorithm this uses starts at truncs and icmps, trawling their use-def
chains until they terminate or instructions outside the loop are found (or
unsafe instructions like inttoptr casts are found). If the use-def chains
starting from different root instructions (truncs/icmps) meet, they are
unioned. The demanded bits of each node in the graph are ORed together to form
an overall mask of the demanded bits in the entire graph. The minimum bitwidth
that graph can be truncated to is the bitwidth minus the number of leading
zeroes in the overall mask.

The intention is that this algorithm should "first do no harm", so it will
never insert extra cast instructions. This is why the use-def graphs are
unioned, so that subgraphs with different minimum bitwidths do not need casts
inserted between them.

This algorithm works hard to reduce compile time impact. DemandedBits are only
queried if there are extends of illegal types and if a truncate to an illegal
type is seen. In the general case, this results in a simple linear scan of the
instructions in the loop.

No non-noise compile time impact was seen on a clang bootstrap build.

Diff Detail

Repository: rL LLVM

Event Timeline

jmolloy updated this revision to Diff 36848.Oct 8 2015, 5:58 AM

jmolloy retitled this revision from to [LoopVectorize] Shrink integer operations into the smallest type possible.

jmolloy updated this object.

jmolloy added reviewers: sbaranga, hfinkel, gberry, anemet.

jmolloy set the repository for this revision to rL LLVM.

jmolloy added a subscriber: llvm-commits.

Herald added a subscriber: aemerson. · View Herald TranscriptOct 8 2015, 5:58 AM

mcrosier added a reviewer: mssimpso.Oct 8 2015, 6:03 AM

Hi James,

I generally really like the idea of using this DemandedBits analysis here. Otherwise, I have a few comments (inline).

Do you see any changes to lnt/spec performance with this patch?

Thanks,
Silviu

lib/Analysis/VectorUtils.cpp
441	Returning possibly large structs is not ideal. Would it be better for this function to take a reference to a map instead?
549	Should be sizeof(LeaderDemandedBits) instead of 64.
lib/Transforms/Vectorize/LoopVectorize.cpp
3146	Is there any way of doing this without leaving the ext/trunc pairs? Maybe instead of generating a trunc look if we're doing it for an ext and if so use the ext operand instead.
3797	Stray whitespace change?
5316–5318	I think you wouldn't get the correct cost here (for ICmp)?

Hi Silviu,

Thanks for the comments, I'll address them all tomorrow. I don't have LNT or spec numbers right now, but I can get them for you tomorrow.

Cheers,

James

lib/Analysis/VectorUtils.cpp
441	Ah, but we're in C++11 now! I did have to check on google that returning a class will invoke the move constructor... apparently so!
549	Agreed. I will make this change.
lib/Transforms/Vectorize/LoopVectorize.cpp
3146	Yeah, I suppose that would be neater than relying on InstCombine. I'll take a look at this.
3797	Woops, will fix.
5316–5318	Ouch, you're right. I'll fix that.

Hi Silviu,

Thanks for the review. Updated diff attached.

Cheers,

James

Thanks, James! I only have one more comment.

-Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
3146	It looks like this solution will leave us some zext instructions with no users. I think we can clean these up as well.

Thanks Silviu - that's a simple enough change. Done.

LGTM!

Regarding performance numbers: as long as we don't have regressions I think it's ok (having regressions with this change would have been suspicious).

Thanks,
Silviu

This revision is now accepted and ready to land.Oct 12 2015, 4:17 AM

Thanks Silviu - r250032.

Revision Contents

Path

Size

include/

llvm/

Analysis/

VectorUtils.h

42 lines

lib/

Analysis/

VectorUtils.cpp

130 lines

Transforms/

Vectorize/

LoopVectorize.cpp

191 lines

test/

Transforms/

LoopVectorize/

AArch64/

loop-vectorization-factors.ll

243 lines

Diff 37093

include/llvm/Analysis/VectorUtils.h

	//===- llvm/Transforms/Utils/VectorUtils.h - Vector utilities -- C++ --=====//			//===- llvm/Transforms/Utils/VectorUtils.h - Vector utilities -- C++ --=====//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file defines some vectorizer utilities.			// This file defines some vectorizer utilities.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_TRANSFORMS_UTILS_VECTORUTILS_H			#ifndef LLVM_TRANSFORMS_UTILS_VECTORUTILS_H
	#define LLVM_TRANSFORMS_UTILS_VECTORUTILS_H			#define LLVM_TRANSFORMS_UTILS_VECTORUTILS_H

				#include "llvm/ADT/ArrayRef.h"
	#include "llvm/Analysis/TargetLibraryInfo.h"			#include "llvm/Analysis/TargetLibraryInfo.h"
	#include "llvm/IR/IntrinsicInst.h"			#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/Intrinsics.h"			#include "llvm/IR/Intrinsics.h"

	namespace llvm {			namespace llvm {

				struct DemandedBits;
	class GetElementPtrInst;			class GetElementPtrInst;
	class Loop;			class Loop;
	class ScalarEvolution;			class ScalarEvolution;
				class TargetTransformInfo;
	class Type;			class Type;
	class Value;			class Value;

	/// \brief Identify if the intrinsic is trivially vectorizable.			/// \brief Identify if the intrinsic is trivially vectorizable.
	/// This method returns true if the intrinsic's argument types are all			/// This method returns true if the intrinsic's argument types are all
	/// scalars for the scalar form of the intrinsic and all vectors for			/// scalars for the scalar form of the intrinsic and all vectors for
	/// the vector form of the intrinsic.			/// the vector form of the intrinsic.
	bool isTriviallyVectorizable(Intrinsic::ID ID);			bool isTriviallyVectorizable(Intrinsic::ID ID);
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	/// from the vector.			/// from the vector.
	Value findScalarElement(Value V, unsigned EltNo);			Value findScalarElement(Value V, unsigned EltNo);

	/// \brief Get splat value if the input is a splat vector or return nullptr.			/// \brief Get splat value if the input is a splat vector or return nullptr.
	/// The value may be extracted from a splat constants vector or from			/// The value may be extracted from a splat constants vector or from
	/// a sequence of instructions that broadcast a single value into a vector.			/// a sequence of instructions that broadcast a single value into a vector.
	Value getSplatValue(Value V);			Value getSplatValue(Value V);

				/// \brief Compute a map of integer instructions to their minimum legal type
				/// size.
				///
				/// C semantics force sub-int-sized values (e.g. i8, i16) to be promoted to int
				/// type (e.g. i32) whenever arithmetic is performed on them.
				///
				/// For targets with native i8 or i16 operations, usually InstCombine can shrink
				/// the arithmetic type down again. However InstCombine refuses to create
				/// illegal types, so for targets without i8 or i16 registers, the lengthening
				/// and shrinking remains.
				///
				/// Most SIMD ISAs (e.g. NEON) however support vectors of i8 or i16 even when
				/// their scalar equivalents do not, so during vectorization it is important to
				/// remove these lengthens and truncates when deciding the profitability of
				/// vectorization.
				///
				/// This function analyzes the given range of instructions and determines the
				/// minimum type size each can be converted to. It attempts to remove or
				/// minimize type size changes across each def-use chain, so for example in the
				/// following code:
				///
				/// %1 = load i8, i8*
				/// %2 = add i8 %1, 2
				/// %3 = load i16, i16*
				/// %4 = zext i8 %2 to i32
				/// %5 = zext i16 %3 to i32
				/// %6 = add i32 %4, %5
				/// %7 = trunc i32 %6 to i16
				///
				/// Instruction %6 must be done at least in i16, so computeMinimumValueSizes
				/// will return: {%1: 16, %2: 16, %3: 16, %4: 16, %5: 16, %6: 16, %7: 16}.
				///
				/// If the optional TargetTransformInfo is provided, this function tries harder
				/// to do less work by only looking at illegal types.
				DenseMap<Instruction*, uint64_t>
				computeMinimumValueSizes(ArrayRef<BasicBlock*> Blocks,
				DemandedBits &DB,
				const TargetTransformInfo *TTI=nullptr);

	} // llvm namespace			} // llvm namespace

	#endif			#endif

lib/Analysis/VectorUtils.cpp

//===----------- VectorUtils.cpp - Vectorizer utility functions -----------===//		//===----------- VectorUtils.cpp - Vectorizer utility functions -----------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file defines vectorizer utilities.		// This file defines vectorizer utilities.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "llvm/ADT/EquivalenceClasses.h"
		#include "llvm/Analysis/DemandedBits.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;
▲ Show 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	llvm::Value llvm::getSplatValue(Value V) {
auto *InsertEltInst =		auto *InsertEltInst =
dyn_cast<InsertElementInst>(ShuffleInst->getOperand(0));		dyn_cast<InsertElementInst>(ShuffleInst->getOperand(0));
if (!InsertEltInst \|\| !isa<ConstantInt>(InsertEltInst->getOperand(2)) \|\|		if (!InsertEltInst \|\| !isa<ConstantInt>(InsertEltInst->getOperand(2)) \|\|
!cast<ConstantInt>(InsertEltInst->getOperand(2))->isNullValue())		!cast<ConstantInt>(InsertEltInst->getOperand(2))->isNullValue())
return nullptr;		return nullptr;

return InsertEltInst->getOperand(1);		return InsertEltInst->getOperand(1);
}		}

		DenseMap<Instruction*, uint64_t> llvm::computeMinimumValueSizes(
		sbarangaUnsubmitted Not Done Reply Inline Actions Returning possibly large structs is not ideal. Would it be better for this function to take a reference to a map instead? sbaranga: Returning possibly large structs is not ideal. Would it be better for this function to take a…
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Ah, but we're in C++11 now! I did have to check on google that returning a class will invoke the move constructor... apparently so! jmolloy: Ah, but we're in C++11 now! I did have to check on google that returning a class will invoke…
		ArrayRef<BasicBlock*> Blocks, DemandedBits &DB,
		const TargetTransformInfo *TTI) {

		// DemandedBits will give us every value's live-out bits. But we want
		// to ensure no extra casts would need to be inserted, so every DAG
		// of connected values must have the same minimum bitwidth.
		EquivalenceClasses<Value*> ECs;
		SmallVector<Value*,16> Worklist;
		SmallPtrSet<Value*,4> Roots;
		SmallPtrSet<Value*,16> Visited;
		DenseMap<Value*,uint64_t> DBits;
		SmallPtrSet<Instruction*,4> InstructionSet;
		DenseMap<Instruction*, uint64_t> MinBWs;

		// Determine the roots. We work bottom-up, from truncs or icmps.
		bool SeenExtFromIllegalType = false;
		for (auto *BB : Blocks)
		for (auto &I : *BB) {
		InstructionSet.insert(&I);

		if (TTI && (isa<ZExtInst>(&I) \|\| isa<SExtInst>(&I)) &&
		!TTI->isTypeLegal(I.getOperand(0)->getType()))
		SeenExtFromIllegalType = true;

		// Only deal with non-vector integers up to 64-bits wide.
		if ((isa<TruncInst>(&I) \|\| isa<ICmpInst>(&I)) &&
		!I.getType()->isVectorTy() &&
		I.getOperand(0)->getType()->getScalarSizeInBits() <= 64) {
		// Don't make work for ourselves. If we know the loaded type is legal,
		// don't add it to the worklist.
		if (TTI && isa<TruncInst>(&I) && TTI->isTypeLegal(I.getType()))
		continue;

		Worklist.push_back(&I);
		Roots.insert(&I);
		}
		}
		// Early exit.
		if (Worklist.empty() \|\| (TTI && !SeenExtFromIllegalType))
		return MinBWs;

		// Now proceed breadth-first, unioning values together.
		while (!Worklist.empty()) {
		Value *Val = Worklist.pop_back_val();
		Value *Leader = ECs.getOrInsertLeaderValue(Val);

		if (Visited.count(Val))
		continue;
		Visited.insert(Val);

		// Non-instructions terminate a chain successfully.
		if (!isa<Instruction>(Val))
		continue;
		Instruction *I = cast<Instruction>(Val);

		// If we encounter a type that is larger than 64 bits, we can't represent
		// it so bail out.
		if (DB.getDemandedBits(I).getBitWidth() > 64)
		return DenseMap<Instruction*,uint64_t>();

		uint64_t V = DB.getDemandedBits(I).getZExtValue();
		DBits[Leader] \|= V;

		// Casts, loads and instructions outside of our range terminate a chain
		// successfully.
		if (isa<SExtInst>(I) \|\| isa<ZExtInst>(I) \|\| isa<LoadInst>(I) \|\|
		!InstructionSet.count(I))
		continue;

		// Unsafe casts terminate a chain unsuccessfully. We can't do anything
		// useful with bitcasts, ptrtoints or inttoptrs and it'd be unsafe to
		// transform anything that relies on them.
		if (isa<BitCastInst>(I) \|\| isa<PtrToIntInst>(I) \|\| isa<IntToPtrInst>(I) \|\|
		!I->getType()->isIntegerTy()) {
		DBits[Leader] \|= ~0ULL;
		continue;
		}

		// We don't modify the types of PHIs. Reductions will already have been
		// truncated if possible, and inductions' sizes will have been chosen by
		// indvars.
		if (isa<PHINode>(I))
		continue;

		if (DBits[Leader] == ~0ULL)
		// All bits demanded, no point continuing.
		continue;

		for (Value *O : cast<User>(I)->operands()) {
		ECs.unionSets(Leader, O);
		Worklist.push_back(O);
		}
		}

		// Now we've discovered all values, walk them to see if there are
		// any users we didn't see. If there are, we can't optimize that
		// chain.
		for (auto &I : DBits)
		for (auto *U : I.first->users())
		if (U->getType()->isIntegerTy() && DBits.count(U) == 0)
		DBits[ECs.getOrInsertLeaderValue(I.first)] \|= ~0ULL;

		for (auto I = ECs.begin(), E = ECs.end(); I != E; ++I) {
		uint64_t LeaderDemandedBits = 0;
		for (auto MI = ECs.member_begin(I), ME = ECs.member_end(); MI != ME; ++MI)
		LeaderDemandedBits \|= DBits[*MI];

		uint64_t MinBW = (sizeof(LeaderDemandedBits) * 8) -
		sbarangaUnsubmitted Not Done Reply Inline Actions Should be sizeof(LeaderDemandedBits) instead of 64. sbaranga: Should be sizeof(LeaderDemandedBits) instead of 64.
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Agreed. I will make this change. jmolloy: Agreed. I will make this change.
		llvm::countLeadingZeros(LeaderDemandedBits);
		// Round up to a power of 2
		if (!isPowerOf2_64((uint64_t)MinBW))
		MinBW = NextPowerOf2(MinBW);
		for (auto MI = ECs.member_begin(I), ME = ECs.member_end(); MI != ME; ++MI) {
		if (!isa<Instruction>(*MI))
		continue;
		Type Ty = (MI)->getType();
		if (Roots.count(*MI))
		Ty = cast<Instruction>(*MI)->getOperand(0)->getType();
		if (MinBW < Ty->getScalarSizeInBits())
		MinBWs[cast<Instruction>(*MI)] = MinBW;
		}
		}

		return MinBWs;
		}

lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
//		//
// S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of		// S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of
// Vectorizing Compilers.		// Vectorizing Compilers.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/EquivalenceClasses.h"
#include "llvm/ADT/Hashing.h"		#include "llvm/ADT/Hashing.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/BasicAliasAnalysis.h"		#include "llvm/Analysis/BasicAliasAnalysis.h"
#include "llvm/Analysis/AliasSetTracker.h"		#include "llvm/Analysis/AliasSetTracker.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/BlockFrequencyInfo.h"		#include "llvm/Analysis/BlockFrequencyInfo.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
		#include "llvm/Analysis/DemandedBits.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/LoopAccessAnalysis.h"		#include "llvm/Analysis/LoopAccessAnalysis.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/LoopIterator.h"		#include "llvm/Analysis/LoopIterator.h"
#include "llvm/Analysis/LoopPass.h"		#include "llvm/Analysis/LoopPass.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpander.h"		#include "llvm/Analysis/ScalarEvolutionExpander.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
Show All 22 Lines
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
#include <algorithm>		#include <algorithm>
		#include <functional>
#include <map>		#include <map>
#include <tuple>		#include <tuple>

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define LV_NAME "loop-vectorize"		#define LV_NAME "loop-vectorize"
#define DEBUG_TYPE LV_NAME		#define DEBUG_TYPE LV_NAME
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	InnerLoopVectorizer(Loop OrigLoop, ScalarEvolution SE, LoopInfo *LI,
unsigned UnrollFactor)		unsigned UnrollFactor)
: OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),		: OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),
VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()),		VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()),
Induction(nullptr), OldInduction(nullptr), WidenMap(UnrollFactor),		Induction(nullptr), OldInduction(nullptr), WidenMap(UnrollFactor),
TripCount(nullptr), VectorTripCount(nullptr), Legal(nullptr),		TripCount(nullptr), VectorTripCount(nullptr), Legal(nullptr),
AddedSafetyChecks(false) {}		AddedSafetyChecks(false) {}

// Perform the actual loop widening (vectorization).		// Perform the actual loop widening (vectorization).
void vectorize(LoopVectorizationLegality *L) {		// MinimumBitWidths maps scalar integer values to the smallest bitwidth they
		// can be validly truncated to. The cost model has assumed this truncation
		// will happen when vectorizing.
		void vectorize(LoopVectorizationLegality *L,
		DenseMap<Instruction*,uint64_t> MinimumBitWidths) {
		MinBWs = MinimumBitWidths;
Legal = L;		Legal = L;
// Create a new empty loop. Unlink the old loop and connect the new one.		// Create a new empty loop. Unlink the old loop and connect the new one.
createEmptyLoop();		createEmptyLoop();
// Widen each instruction in the old loop to a new one in the new loop.		// Widen each instruction in the old loop to a new one in the new loop.
// Use the Legality module to find the induction and reduction variables.		// Use the Legality module to find the induction and reduction variables.
vectorizeLoop();		vectorizeLoop();
}		}

Show All 32 Lines	protected:
virtual void vectorizeLoop();		virtual void vectorizeLoop();

/// \brief The Loop exit block may have single value PHI nodes where the		/// \brief The Loop exit block may have single value PHI nodes where the
/// incoming value is 'Undef'. While vectorizing we only handled real values		/// incoming value is 'Undef'. While vectorizing we only handled real values
/// that were defined inside the loop. Here we fix the 'undef case'.		/// that were defined inside the loop. Here we fix the 'undef case'.
/// See PR14725.		/// See PR14725.
void fixLCSSAPHIs();		void fixLCSSAPHIs();

		/// Shrinks vector element sizes based on information in "MinBWs".
		void truncateToMinimalBitwidths();

/// A helper function that computes the predicate of the block BB, assuming		/// A helper function that computes the predicate of the block BB, assuming
/// that the header block of the loop is set to True. It returns the entry		/// that the header block of the loop is set to True. It returns the entry
/// mask for the block BB.		/// mask for the block BB.
VectorParts createBlockInMask(BasicBlock *BB);		VectorParts createBlockInMask(BasicBlock *BB);
/// A helper function that computes the predicate of the edge between SRC		/// A helper function that computes the predicate of the edge between SRC
/// and DST.		/// and DST.
VectorParts createEdgeMask(BasicBlock Src, BasicBlock Dst);		VectorParts createEdgeMask(BasicBlock Src, BasicBlock Dst);

/// A helper function to vectorize a single BB within the innermost loop.		/// A helper function to vectorize a single BB within the innermost loop.
void vectorizeBlockInLoop(BasicBlock BB, PhiVector PV);		void vectorizeBlockInLoop(BasicBlock BB, PhiVector PV);

/// Vectorize a single PHINode in a block. This method handles the induction		/// Vectorize a single PHINode in a block. This method handles the induction
/// variable canonicalization. It supports both VF = 1 for unrolled loops and		/// variable canonicalization. It supports both VF = 1 for unrolled loops and
/// arbitrary length vectors.		/// arbitrary length vectors.
void widenPHIInstruction(Instruction *PN, VectorParts &Entry,		void widenPHIInstruction(Instruction *PN, VectorParts &Entry,
unsigned UF, unsigned VF, PhiVector *PV);		unsigned UF, unsigned VF, PhiVector *PV);

/// Insert the new loop to the loop hierarchy and pass manager		/// Insert the new loop to the loop hierarchy and pass manager
/// and update the analysis passes.		/// and update the analysis passes.
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	protected:
/// <StoreInst, Predicate>		/// <StoreInst, Predicate>
SmallVector<std::pair<StoreInst,Value>, 4> PredicatedStores;		SmallVector<std::pair<StoreInst,Value>, 4> PredicatedStores;
EdgeMaskCache MaskCache;		EdgeMaskCache MaskCache;
/// Trip count of the original loop.		/// Trip count of the original loop.
Value *TripCount;		Value *TripCount;
/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))		/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))
Value *VectorTripCount;		Value *VectorTripCount;

		/// Map of scalar integer values to the smallest bitwidth they can be legally
		/// represented as. The vector equivalents of these values should be truncated
		/// to this type.
		DenseMap<Instruction*,uint64_t> MinBWs;
LoopVectorizationLegality *Legal;		LoopVectorizationLegality *Legal;

// Record whether runtime check is added.		// Record whether runtime check is added.
bool AddedSafetyChecks;		bool AddedSafetyChecks;
};		};

class InnerLoopUnroller : public InnerLoopVectorizer {		class InnerLoopUnroller : public InnerLoopVectorizer {
public:		public:
▲ Show 20 Lines • Show All 831 Lines • ▼ Show 20 Lines
/// expected speedup/slowdowns due to the supported instruction set. We use the		/// expected speedup/slowdowns due to the supported instruction set. We use the
/// TargetTransformInfo to query the different backends for the cost of		/// TargetTransformInfo to query the different backends for the cost of
/// different operations.		/// different operations.
class LoopVectorizationCostModel {		class LoopVectorizationCostModel {
public:		public:
LoopVectorizationCostModel(Loop L, ScalarEvolution SE, LoopInfo *LI,		LoopVectorizationCostModel(Loop L, ScalarEvolution SE, LoopInfo *LI,
LoopVectorizationLegality *Legal,		LoopVectorizationLegality *Legal,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
const TargetLibraryInfo TLI, AssumptionCache AC,		const TargetLibraryInfo TLI, DemandedBits DB,
		AssumptionCache *AC,
const Function F, const LoopVectorizeHints Hints,		const Function F, const LoopVectorizeHints Hints,
SmallPtrSetImpl<const Value *> &ValuesToIgnore)		SmallPtrSetImpl<const Value *> &ValuesToIgnore)
: TheLoop(L), SE(SE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI),		: TheLoop(L), SE(SE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),
TheFunction(F), Hints(Hints), ValuesToIgnore(ValuesToIgnore) {}		TheFunction(F), Hints(Hints), ValuesToIgnore(ValuesToIgnore) {}

/// Information about vectorization costs		/// Information about vectorization costs
struct VectorizationFactor {		struct VectorizationFactor {
unsigned Width; // Vector width with best cost		unsigned Width; // Vector width with best cost
unsigned Cost; // Cost of the loop with that width		unsigned Cost; // Cost of the loop with that width
};		};
/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	private:
/// Report an analysis message to assist the user in diagnosing loops that are		/// Report an analysis message to assist the user in diagnosing loops that are
/// not vectorized. These are handled as LoopAccessReport rather than		/// not vectorized. These are handled as LoopAccessReport rather than
/// VectorizationReport because the << operator of VectorizationReport returns		/// VectorizationReport because the << operator of VectorizationReport returns
/// LoopAccessReport.		/// LoopAccessReport.
void emitAnalysis(const LoopAccessReport &Message) const {		void emitAnalysis(const LoopAccessReport &Message) const {
emitAnalysisDiag(TheFunction, TheLoop, *Hints, Message);		emitAnalysisDiag(TheFunction, TheLoop, *Hints, Message);
}		}

		public:
		/// Map of scalar integer values to the smallest bitwidth they can be legally
		/// represented as. The vector equivalents of these values should be truncated
		/// to this type.
		DenseMap<Instruction*,uint64_t> MinBWs;

/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;
/// Scev analysis.		/// Scev analysis.
ScalarEvolution *SE;		ScalarEvolution *SE;
/// Loop Info analysis.		/// Loop Info analysis.
LoopInfo *LI;		LoopInfo *LI;
/// Vectorization legality.		/// Vectorization legality.
LoopVectorizationLegality *Legal;		LoopVectorizationLegality *Legal;
/// Vector target information.		/// Vector target information.
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
/// Target Library Info.		/// Target Library Info.
const TargetLibraryInfo *TLI;		const TargetLibraryInfo *TLI;
		/// Demanded bits analysis
		DemandedBits *DB;
const Function *TheFunction;		const Function *TheFunction;
// Loop Vectorize Hint.		// Loop Vectorize Hint.
const LoopVectorizeHints *Hints;		const LoopVectorizeHints *Hints;
// Values to ignore in the cost model.		// Values to ignore in the cost model.
const SmallPtrSetImpl<const Value *> &ValuesToIgnore;		const SmallPtrSetImpl<const Value *> &ValuesToIgnore;
};		};

/// \brief This holds vectorization requirements that must be verified late in		/// \brief This holds vectorization requirements that must be verified late in
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	struct LoopVectorize : public FunctionPass {
}		}

ScalarEvolution *SE;		ScalarEvolution *SE;
LoopInfo *LI;		LoopInfo *LI;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
DominatorTree *DT;		DominatorTree *DT;
BlockFrequencyInfo *BFI;		BlockFrequencyInfo *BFI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
		DemandedBits *DB;
AliasAnalysis *AA;		AliasAnalysis *AA;
AssumptionCache *AC;		AssumptionCache *AC;
LoopAccessAnalysis *LAA;		LoopAccessAnalysis *LAA;
bool DisableUnrolling;		bool DisableUnrolling;
bool AlwaysVectorize;		bool AlwaysVectorize;

BlockFrequency ColdEntryFreq;		BlockFrequency ColdEntryFreq;

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();		SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
BFI = &getAnalysis<BlockFrequencyInfoWrapperPass>().getBFI();		BFI = &getAnalysis<BlockFrequencyInfoWrapperPass>().getBFI();
auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();		auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
TLI = TLIP ? &TLIP->getTLI() : nullptr;		TLI = TLIP ? &TLIP->getTLI() : nullptr;
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
LAA = &getAnalysis<LoopAccessAnalysis>();		LAA = &getAnalysis<LoopAccessAnalysis>();
		DB = &getAnalysis<DemandedBits>();

// Compute some weights outside of the loop over the loops. Compute this		// Compute some weights outside of the loop over the loops. Compute this
// using a BranchProbability to re-use its scaling math.		// using a BranchProbability to re-use its scaling math.
const BranchProbability ColdProb(1, 5); // 20%		const BranchProbability ColdProb(1, 5); // 20%
ColdEntryFreq = BlockFrequency(BFI->getEntryFreq()) * ColdProb;		ColdEntryFreq = BlockFrequency(BFI->getEntryFreq()) * ColdProb;

// Don't attempt if		// Don't attempt if
// 1. the target claims to have no vector registers, and		// 1. the target claims to have no vector registers, and
▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
CodeMetrics::collectEphemeralValues(L, AC, ValuesToIgnore);		CodeMetrics::collectEphemeralValues(L, AC, ValuesToIgnore);
for (auto &Reduction : *LVL.getReductionVars()) {		for (auto &Reduction : *LVL.getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
ValuesToIgnore.insert(Casts.begin(), Casts.end());		ValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}

// Use the cost model.		// Use the cost model.
LoopVectorizationCostModel CM(L, SE, LI, &LVL, *TTI, TLI, AC, F, &Hints,		LoopVectorizationCostModel CM(L, SE, LI, &LVL, *TTI, TLI, DB, AC, F, &Hints,
ValuesToIgnore);		ValuesToIgnore);

// Check the function attributes to find out if this function should be		// Check the function attributes to find out if this function should be
// optimized for size.		// optimized for size.
bool OptForSize = Hints.getForce() != LoopVectorizeHints::FK_Enabled &&		bool OptForSize = Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
F->optForSize();		F->optForSize();

// Compute the weighted frequency of this loop being executed and see if it		// Compute the weighted frequency of this loop being executed and see if it
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	if (!VectorizeLoop && !InterleaveLoop) {
DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');		DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');
}		}

if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop then		// If we decided that it is not legal to vectorize the loop then
// interleave it.		// interleave it.
InnerLoopUnroller Unroller(L, SE, LI, DT, TLI, TTI, IC);		InnerLoopUnroller Unroller(L, SE, LI, DT, TLI, TTI, IC);
Unroller.vectorize(&LVL);		Unroller.vectorize(&LVL, CM.MinBWs);

emitOptimizationRemark(F->getContext(), LV_NAME, *F, L->getStartLoc(),		emitOptimizationRemark(F->getContext(), LV_NAME, *F, L->getStartLoc(),
Twine("interleaved loop (interleaved count: ") +		Twine("interleaved loop (interleaved count: ") +
Twine(IC) + ")");		Twine(IC) + ")");
} else {		} else {
// If we decided that it is legal to vectorize the loop then do it.		// If we decided that it is legal to vectorize the loop then do it.
InnerLoopVectorizer LB(L, SE, LI, DT, TLI, TTI, VF.Width, IC);		InnerLoopVectorizer LB(L, SE, LI, DT, TLI, TTI, VF.Width, IC);
LB.vectorize(&LVL);		LB.vectorize(&LVL, CM.MinBWs);
++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling scalar loop when there's no		// Add metadata to disable runtime unrolling scalar loop when there's no
// runtime check about strides and memory. Because at this situation,		// runtime check about strides and memory. Because at this situation,
// scalar loop is rarely used not worthy to be unrolled.		// scalar loop is rarely used not worthy to be unrolled.
if (!LB.IsSafetyChecksAdded())		if (!LB.IsSafetyChecksAdded())
AddRuntimeUnrollDisableMetaData(L);		AddRuntimeUnrollDisableMetaData(L);

Show All 17 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequiredID(LCSSAID);		AU.addRequiredID(LCSSAID);
AU.addRequired<BlockFrequencyInfoWrapperPass>();		AU.addRequired<BlockFrequencyInfoWrapperPass>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<LoopInfoWrapperPass>();		AU.addRequired<LoopInfoWrapperPass>();
AU.addRequired<ScalarEvolutionWrapperPass>();		AU.addRequired<ScalarEvolutionWrapperPass>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addRequired<AAResultsWrapperPass>();		AU.addRequired<AAResultsWrapperPass>();
AU.addRequired<LoopAccessAnalysis>();		AU.addRequired<LoopAccessAnalysis>();
		AU.addRequired<DemandedBits>();
AU.addPreserved<LoopInfoWrapperPass>();		AU.addPreserved<LoopInfoWrapperPass>();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
AU.addPreserved<BasicAAWrapperPass>();		AU.addPreserved<BasicAAWrapperPass>();
AU.addPreserved<AAResultsWrapperPass>();		AU.addPreserved<AAResultsWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
}		}

};		};
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	InnerLoopVectorizer::getVectorValue(Value *V) {

// If we have this scalar in the map, return it.		// If we have this scalar in the map, return it.
if (WidenMap.has(V))		if (WidenMap.has(V))
return WidenMap.get(V);		return WidenMap.get(V);

// If this scalar is unknown, assume that it is a constant or that it is		// If this scalar is unknown, assume that it is a constant or that it is
// loop invariant. Broadcast V and save the value for future uses.		// loop invariant. Broadcast V and save the value for future uses.
Value *B = getBroadcastInstrs(V);		Value *B = getBroadcastInstrs(V);

return WidenMap.splat(V, B);		return WidenMap.splat(V, B);
}		}

Value InnerLoopVectorizer::reverseVector(Value Vec) {		Value InnerLoopVectorizer::reverseVector(Value Vec) {
assert(Vec->getType()->isVectorTy() && "Invalid type");		assert(Vec->getType()->isVectorTy() && "Invalid type");
SmallVector<Constant*, 8> ShuffleMask;		SmallVector<Constant*, 8> ShuffleMask;
for (unsigned i = 0; i < VF; ++i)		for (unsigned i = 0; i < VF; ++i)
ShuffleMask.push_back(Builder.getInt32(VF - i - 1));		ShuffleMask.push_back(Builder.getInt32(VF - i - 1));
▲ Show 20 Lines • Show All 1,077 Lines • ▼ Show 20 Lines	static unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF,
Type *RetTy = ToVectorTy(CI->getType(), VF);		Type *RetTy = ToVectorTy(CI->getType(), VF);
SmallVector<Type *, 4> Tys;		SmallVector<Type *, 4> Tys;
for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)		for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));		Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));

return TTI.getIntrinsicInstrCost(ID, RetTy, Tys);		return TTI.getIntrinsicInstrCost(ID, RetTy, Tys);
}		}

		static Type smallestIntegerVectorType(Type T1, Type *T2) {
		IntegerType *I1 = cast<IntegerType>(T1->getVectorElementType());
		IntegerType *I2 = cast<IntegerType>(T2->getVectorElementType());
		return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2;
		}
		static Type largestIntegerVectorType(Type T1, Type *T2) {
		IntegerType *I1 = cast<IntegerType>(T1->getVectorElementType());
		IntegerType *I2 = cast<IntegerType>(T2->getVectorElementType());
		return I1->getBitWidth() > I2->getBitWidth() ? T1 : T2;
		}

		void InnerLoopVectorizer::truncateToMinimalBitwidths() {
		// For every instruction `I` in MinBWs, truncate the operands, create a
		// truncated version of `I` and reextend its result. InstCombine runs
		// later and will remove any ext/trunc pairs.
		//
		sbarangaUnsubmitted Not Done Reply Inline Actions Is there any way of doing this without leaving the ext/trunc pairs? Maybe instead of generating a trunc look if we're doing it for an ext and if so use the ext operand instead. sbaranga: Is there any way of doing this without leaving the ext/trunc pairs? Maybe instead of…
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Yeah, I suppose that would be neater than relying on InstCombine. I'll take a look at this. jmolloy: Yeah, I suppose that would be neater than relying on InstCombine. I'll take a look at this.
		sbarangaUnsubmitted Not Done Reply Inline Actions It looks like this solution will leave us some zext instructions with no users. I think we can clean these up as well. sbaranga: It looks like this solution will leave us some zext instructions with no users. I think we can…
		for (auto &KV : MinBWs) {
		VectorParts &Parts = WidenMap.get(KV.first);
		for (Value *&I : Parts) {
		if (I->use_empty())
		continue;
		Type *OriginalTy = I->getType();
		Type *ScalarTruncatedTy = IntegerType::get(OriginalTy->getContext(),
		KV.second);
		Type *TruncatedTy = VectorType::get(ScalarTruncatedTy,
		OriginalTy->getVectorNumElements());
		if (TruncatedTy == OriginalTy)
		continue;

		IRBuilder<> B(cast<Instruction>(I));
		auto ShrinkOperand = [&](Value V) -> Value {
		if (auto *ZI = dyn_cast<ZExtInst>(V))
		if (ZI->getSrcTy() == TruncatedTy)
		return ZI->getOperand(0);
		return B.CreateZExtOrTrunc(V, TruncatedTy);
		};

		// The actual instruction modification depends on the instruction type,
		// unfortunately.
		Value *NewI = nullptr;
		if (BinaryOperator *BO = dyn_cast<BinaryOperator>(I)) {
		NewI = B.CreateBinOp(BO->getOpcode(),
		ShrinkOperand(BO->getOperand(0)),
		ShrinkOperand(BO->getOperand(1)));
		cast<BinaryOperator>(NewI)->copyIRFlags(I);
		} else if (ICmpInst *CI = dyn_cast<ICmpInst>(I)) {
		NewI = B.CreateICmp(CI->getPredicate(),
		ShrinkOperand(CI->getOperand(0)),
		ShrinkOperand(CI->getOperand(1)));
		} else if (SelectInst *SI = dyn_cast<SelectInst>(I)) {
		NewI = B.CreateSelect(SI->getCondition(),
		ShrinkOperand(SI->getTrueValue()),
		ShrinkOperand(SI->getFalseValue()));
		} else if (CastInst *CI = dyn_cast<CastInst>(I)) {
		switch (CI->getOpcode()) {
		default: llvm_unreachable("Unhandled cast!");
		case Instruction::Trunc:
		NewI = ShrinkOperand(CI->getOperand(0));
		break;
		case Instruction::SExt:
		NewI = B.CreateSExtOrTrunc(CI->getOperand(0),
		smallestIntegerVectorType(OriginalTy,
		TruncatedTy));
		break;
		case Instruction::ZExt:
		NewI = B.CreateZExtOrTrunc(CI->getOperand(0),
		smallestIntegerVectorType(OriginalTy,
		TruncatedTy));
		break;
		}
		} else if (ShuffleVectorInst *SI = dyn_cast<ShuffleVectorInst>(I)) {
		auto Elements0 = SI->getOperand(0)->getType()->getVectorNumElements();
		auto *O0 =
		B.CreateZExtOrTrunc(SI->getOperand(0),
		VectorType::get(ScalarTruncatedTy, Elements0));
		auto Elements1 = SI->getOperand(1)->getType()->getVectorNumElements();
		auto *O1 =
		B.CreateZExtOrTrunc(SI->getOperand(1),
		VectorType::get(ScalarTruncatedTy, Elements1));

		NewI = B.CreateShuffleVector(O0, O1, SI->getMask());
		} else if (isa<LoadInst>(I)) {
		// Don't do anything with the operands, just extend the result.
		continue;
		} else {
		llvm_unreachable("Unhandled instruction type!");
		}

		// Lastly, extend the result.
		NewI->takeName(cast<Instruction>(I));
		Value *Res = B.CreateZExtOrTrunc(NewI, OriginalTy);
		I->replaceAllUsesWith(Res);
		cast<Instruction>(I)->eraseFromParent();
		I = Res;
		}
		}

		// We'll have created a bunch of ZExts that are now parentless. Clean up.
		for (auto &KV : MinBWs) {
		VectorParts &Parts = WidenMap.get(KV.first);
		for (Value *&I : Parts) {
		ZExtInst *Inst = dyn_cast<ZExtInst>(I);
		if (Inst && Inst->use_empty()) {
		Value *NewI = Inst->getOperand(0);
		Inst->eraseFromParent();
		I = NewI;
		}
		}
		}
		}

void InnerLoopVectorizer::vectorizeLoop() {		void InnerLoopVectorizer::vectorizeLoop() {
//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
// Notice: any optimization or new instruction that go		// Notice: any optimization or new instruction that go
// into the code below should be also be implemented in		// into the code below should be also be implemented in
// the cost-model.		// the cost-model.
//		//
//===------------------------------------------------===//		//===------------------------------------------------===//
Show All 14 Lines	void InnerLoopVectorizer::vectorizeLoop() {
LoopBlocksDFS DFS(OrigLoop);		LoopBlocksDFS DFS(OrigLoop);
DFS.perform(LI);		DFS.perform(LI);

// Vectorize all of the blocks in the original loop.		// Vectorize all of the blocks in the original loop.
for (LoopBlocksDFS::RPOIterator bb = DFS.beginRPO(),		for (LoopBlocksDFS::RPOIterator bb = DFS.beginRPO(),
be = DFS.endRPO(); bb != be; ++bb)		be = DFS.endRPO(); bb != be; ++bb)
vectorizeBlockInLoop(*bb, &RdxPHIsToFix);		vectorizeBlockInLoop(*bb, &RdxPHIsToFix);

		// Insert truncates and extends for any truncated instructions as hints to
		// InstCombine.
		if (VF > 1)
		truncateToMinimalBitwidths();

// At this point every instruction in the original loop is widened to		// At this point every instruction in the original loop is widened to
// a vector form. We are almost done. Now, we need to fix the PHI nodes		// a vector form. We are almost done. Now, we need to fix the PHI nodes
// that we vectorized. The PHI nodes are currently empty because we did		// that we vectorized. The PHI nodes are currently empty because we did
// not want to introduce cycles. Notice that the remaining PHI nodes		// not want to introduce cycles. Notice that the remaining PHI nodes
// that we need to fix are reduction variables.		// that we need to fix are reduction variables.

// Create the 'reduced' values for each of the induction vars.		// Create the 'reduced' values for each of the induction vars.
// The reduced values are the vector values that we scalarize and combine		// The reduced values are the vector values that we scalarize and combine
▲ Show 20 Lines • Show All 417 Lines • ▼ Show 20 Lines	case InductionDescriptor::IK_PtrInduction:
return;		return;
}		}
}		}

void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {		void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {
// For each instruction in the old loop.		// For each instruction in the old loop.
for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {		for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
VectorParts &Entry = WidenMap.get(it);		VectorParts &Entry = WidenMap.get(it);

switch (it->getOpcode()) {		switch (it->getOpcode()) {
case Instruction::Br:		case Instruction::Br:
// Nothing to do for PHIs and BR, since we already took care of the		// Nothing to do for PHIs and BR, since we already took care of the
// loop control flow instructions.		// loop control flow instructions.
continue;		continue;
case Instruction::PHI: {		case Instruction::PHI: {
// Vectorize PHINodes.		// Vectorize PHINodes.
widenPHIInstruction(it, Entry, UF, VF, PV);		widenPHIInstruction(it, Entry, UF, VF, PV);
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	case Instruction::Select: {

// The condition can be loop invariant but still defined inside the		// The condition can be loop invariant but still defined inside the
// loop. This means that we can't just use the original 'cond' value.		// loop. This means that we can't just use the original 'cond' value.
// We have to take the 'vectorized' value and pick the first lane.		// We have to take the 'vectorized' value and pick the first lane.
// Instcombine will make this a no-op.		// Instcombine will make this a no-op.
VectorParts &Cond = getVectorValue(it->getOperand(0));		VectorParts &Cond = getVectorValue(it->getOperand(0));
VectorParts &Op0 = getVectorValue(it->getOperand(1));		VectorParts &Op0 = getVectorValue(it->getOperand(1));
VectorParts &Op1 = getVectorValue(it->getOperand(2));		VectorParts &Op1 = getVectorValue(it->getOperand(2));

Value *ScalarCond = (VF == 1) ? Cond[0] :		Value *ScalarCond = (VF == 1) ? Cond[0] :
Builder.CreateExtractElement(Cond[0], Builder.getInt32(0));		Builder.CreateExtractElement(Cond[0], Builder.getInt32(0));

for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Entry[Part] = Builder.CreateSelect(		Entry[Part] = Builder.CreateSelect(
InvariantCond ? ScalarCond : Cond[Part],		InvariantCond ? ScalarCond : Cond[Part],
Op0[Part],		Op0[Part],
Op1[Part]);		Op1[Part]);
}		}

propagateMetadata(Entry, it);		propagateMetadata(Entry, it);
break;		break;
}		}

case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Widen compares. Generate vector compares.		// Widen compares. Generate vector compares.
bool FCmp = (it->getOpcode() == Instruction::FCmp);		bool FCmp = (it->getOpcode() == Instruction::FCmp);
CmpInst *Cmp = dyn_cast<CmpInst>(it);		CmpInst *Cmp = dyn_cast<CmpInst>(it);
setDebugLocFromInst(Builder, it);		setDebugLocFromInst(Builder, it);
VectorParts &A = getVectorValue(it->getOperand(0));		VectorParts &A = getVectorValue(it->getOperand(0));
VectorParts &B = getVectorValue(it->getOperand(1));		VectorParts &B = getVectorValue(it->getOperand(1));
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
		sbarangaUnsubmitted Not Done Reply Inline Actions Stray whitespace change? sbaranga: Stray whitespace change?
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Woops, will fix. jmolloy: Woops, will fix.
Value *C = nullptr;		Value *C = nullptr;
if (FCmp) {		if (FCmp) {
C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]);		C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]);
cast<FCmpInst>(C)->copyFastMathFlags(it);		cast<FCmpInst>(C)->copyFastMathFlags(it);
} else {		} else {
C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]);		C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]);
}		}
Entry[Part] = C;		Entry[Part] = C;
▲ Show 20 Lines • Show All 895 Lines • ▼ Show 20 Lines	if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {
DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");		DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");
return Factor;		return Factor;
}		}

// Find the trip count.		// Find the trip count.
unsigned TC = SE->getSmallConstantTripCount(TheLoop);		unsigned TC = SE->getSmallConstantTripCount(TheLoop);
DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

		MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
unsigned WidestType = getWidestType();		unsigned WidestType = getWidestType();
unsigned WidestRegister = TTI.getRegisterBitWidth(true);		unsigned WidestRegister = TTI.getRegisterBitWidth(true);
unsigned MaxSafeDepDist = -1U;		unsigned MaxSafeDepDist = -1U;
if (Legal->getMaxSafeDepDistBytes() != -1U)		if (Legal->getMaxSafeDepDistBytes() != -1U)
MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8;		MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8;
WidestRegister = ((WidestRegister < MaxSafeDepDist) ?		WidestRegister = ((WidestRegister < MaxSafeDepDist) ?
WidestRegister : MaxSafeDepDist);		WidestRegister : MaxSafeDepDist);
unsigned MaxVectorSize = WidestRegister / WidestType;		unsigned MaxVectorSize = WidestRegister / WidestType;
▲ Show 20 Lines • Show All 507 Lines • ▼ Show 20 Lines
unsigned		unsigned
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {		LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (Legal->isUniformAfterVectorization(I))		if (Legal->isUniformAfterVectorization(I))
VF = 1;		VF = 1;

Type *RetTy = I->getType();		Type *RetTy = I->getType();
		if (VF > 1 && MinBWs.count(I))
		RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
Type *VectorTy = ToVectorTy(RetTy, VF);		Type *VectorTy = ToVectorTy(RetTy, VF);

// TODO: We need to estimate the cost of intrinsic calls.		// TODO: We need to estimate the cost of intrinsic calls.
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case Instruction::GetElementPtr:		case Instruction::GetElementPtr:
// We mark this instruction as zero-cost because the cost of GEPs in		// We mark this instruction as zero-cost because the cost of GEPs in
// vectorized code depends on whether the corresponding memory instruction		// vectorized code depends on whether the corresponding memory instruction
// is scalarized or not. Therefore, we handle GEPs with the memory		// is scalarized or not. Therefore, we handle GEPs with the memory
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	case Instruction::Select: {
Type *CondTy = SI->getCondition()->getType();		Type *CondTy = SI->getCondition()->getType();
if (!ScalarCond)		if (!ScalarCond)
CondTy = VectorType::get(CondTy, VF);		CondTy = VectorType::get(CondTy, VF);

return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy);		return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy);
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
Type *ValTy = I->getOperand(0)->getType();		Type *ValTy = I->getOperand(0)->getType();
		if (VF > 1 && MinBWs.count(dyn_cast<Instruction>(I->getOperand(0))))
		ValTy = IntegerType::get(ValTy->getContext(), MinBWs[I]);
		sbarangaUnsubmitted Not Done Reply Inline Actions I think you wouldn't get the correct cost here (for ICmp)? sbaranga: I think you wouldn't get the correct cost here (for ICmp)?
		jmolloyAuthorUnsubmitted Not Done Reply Inline Actions Ouch, you're right. I'll fix that. jmolloy: Ouch, you're right. I'll fix that.
VectorTy = ToVectorTy(ValTy, VF);		VectorTy = ToVectorTy(ValTy, VF);
return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy);		return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy);
}		}
case Instruction::Store:		case Instruction::Store:
case Instruction::Load: {		case Instruction::Load: {
StoreInst *SI = dyn_cast<StoreInst>(I);		StoreInst *SI = dyn_cast<StoreInst>(I);
LoadInst *LI = dyn_cast<LoadInst>(I);		LoadInst *LI = dyn_cast<LoadInst>(I);
Type *ValTy = (SI ? SI->getValueOperand()->getType() :		Type *ValTy = (SI ? SI->getValueOperand()->getType() :
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
// We optimize the truncation of induction variable.		// We optimize the truncation of induction variable.
// The cost of these is the same as the scalar operation.		// The cost of these is the same as the scalar operation.
if (I->getOpcode() == Instruction::Trunc &&		if (I->getOpcode() == Instruction::Trunc &&
Legal->isInductionVariable(I->getOperand(0)))		Legal->isInductionVariable(I->getOperand(0)))
return TTI.getCastInstrCost(I->getOpcode(), I->getType(),		return TTI.getCastInstrCost(I->getOpcode(), I->getType(),
I->getOperand(0)->getType());		I->getOperand(0)->getType());

Type *SrcVecTy = ToVectorTy(I->getOperand(0)->getType(), VF);		Type *SrcScalarTy = I->getOperand(0)->getType();
		Type *SrcVecTy = ToVectorTy(SrcScalarTy, VF);
		if (VF > 1 && MinBWs.count(I)) {
		// This cast is going to be shrunk. This may remove the cast or it might
		// turn it into slightly different cast. For example, if MinBW == 16,
		// "zext i8 %1 to i32" becomes "zext i8 %1 to i16".
		//
		// Calculate the modified src and dest types.
		Type *MinVecTy = VectorTy;
		if (I->getOpcode() == Instruction::Trunc) {
		SrcVecTy = smallestIntegerVectorType(SrcVecTy, MinVecTy);
		VectorTy = largestIntegerVectorType(ToVectorTy(I->getType(), VF),
		MinVecTy);
		} else if (I->getOpcode() == Instruction::ZExt \|\|
		I->getOpcode() == Instruction::SExt) {
		SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy);
		VectorTy = smallestIntegerVectorType(ToVectorTy(I->getType(), VF),
		MinVecTy);
		}
		}

return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy);		return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy);
}		}
case Instruction::Call: {		case Instruction::Call: {
bool NeedToScalarize;		bool NeedToScalarize;
CallInst *CI = cast<CallInst>(I);		CallInst *CI = cast<CallInst>(I);
unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);		unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);
if (getIntrinsicIDForCall(CI, TLI))		if (getIntrinsicIDForCall(CI, TLI))
return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));		return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));
Show All 34 Lines
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)		INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LCSSA)		INITIALIZE_PASS_DEPENDENCY(LCSSA)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopSimplify)		INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
INITIALIZE_PASS_DEPENDENCY(LoopAccessAnalysis)		INITIALIZE_PASS_DEPENDENCY(LoopAccessAnalysis)
		INITIALIZE_PASS_DEPENDENCY(DemandedBits)
INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false)		INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false)

namespace llvm {		namespace llvm {
Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) {		Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) {
return new LoopVectorize(NoUnrolling, AlwaysVectorize);		return new LoopVectorize(NoUnrolling, AlwaysVectorize);
}		}
}		}

▲ Show 20 Lines • Show All 127 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

This file was added.

				; RUN: opt -S < %s -basicaa -loop-vectorize -simplifycfg -instsimplify -instcombine -licm -force-vector-interleave=1 2>&1 \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64"

				; CHECK-LABEL: @add_a(
				; CHECK: load <16 x i8>, <16 x i8>*
				; CHECK: add nuw nsw <16 x i8>
				; CHECK: store <16 x i8>
				; Function Attrs: nounwind
				define void @add_a(i8* noalias nocapture readonly %p, i8* noalias nocapture %q, i32 %len) #0 {
				entry:
				%cmp8 = icmp sgt i32 %len, 0
				br i1 %cmp8, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i8, i8* %p, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx
				%conv = zext i8 %0 to i32
				%add = add nuw nsw i32 %conv, 2
				%conv1 = trunc i32 %add to i8
				%arrayidx3 = getelementptr inbounds i8, i8* %q, i64 %indvars.iv
				store i8 %conv1, i8* %arrayidx3
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %len
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @add_b(
				; CHECK: load <8 x i16>, <8 x i16>*
				; CHECK: add nuw nsw <8 x i16>
				; CHECK: store <8 x i16>
				; Function Attrs: nounwind
				define void @add_b(i16* noalias nocapture readonly %p, i16* noalias nocapture %q, i32 %len) #0 {
				entry:
				%cmp9 = icmp sgt i32 %len, 0
				br i1 %cmp9, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i16, i16* %p, i64 %indvars.iv
				%0 = load i16, i16* %arrayidx
				%conv8 = zext i16 %0 to i32
				%add = add nuw nsw i32 %conv8, 2
				%conv1 = trunc i32 %add to i16
				%arrayidx3 = getelementptr inbounds i16, i16* %q, i64 %indvars.iv
				store i16 %conv1, i16* %arrayidx3
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %len
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @add_c(
				; CHECK: load <8 x i8>, <8 x i8>*
				; CHECK: add nuw nsw <8 x i16>
				; CHECK: store <8 x i16>
				; Function Attrs: nounwind
				define void @add_c(i8* noalias nocapture readonly %p, i16* noalias nocapture %q, i32 %len) #0 {
				entry:
				%cmp8 = icmp sgt i32 %len, 0
				br i1 %cmp8, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i8, i8* %p, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx
				%conv = zext i8 %0 to i32
				%add = add nuw nsw i32 %conv, 2
				%conv1 = trunc i32 %add to i16
				%arrayidx3 = getelementptr inbounds i16, i16* %q, i64 %indvars.iv
				store i16 %conv1, i16* %arrayidx3
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %len
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @add_d(
				; CHECK: load <4 x i16>
				; CHECK: add nsw <4 x i32>
				; CHECK: store <4 x i32>
				define void @add_d(i16* noalias nocapture readonly %p, i32* noalias nocapture %q, i32 %len) #0 {
				entry:
				%cmp7 = icmp sgt i32 %len, 0
				br i1 %cmp7, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i16, i16* %p, i64 %indvars.iv
				%0 = load i16, i16* %arrayidx
				%conv = sext i16 %0 to i32
				%add = add nsw i32 %conv, 2
				%arrayidx2 = getelementptr inbounds i32, i32* %q, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %len
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @add_e(
				; CHECK: load <16 x i8>
				; CHECK: shl <16 x i8>
				; CHECK: add nuw nsw <16 x i8>
				; CHECK: or <16 x i8>
				; CHECK: mul nuw nsw <16 x i8>
				; CHECK: and <16 x i8>
				; CHECK: xor <16 x i8>
				; CHECK: mul nuw nsw <16 x i8>
				; CHECK: store <16 x i8>
				define void @add_e(i8* noalias nocapture readonly %p, i8* noalias nocapture %q, i8 %arg1, i8 %arg2, i32 %len) #0 {
				entry:
				%cmp.32 = icmp sgt i32 %len, 0
				br i1 %cmp.32, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%conv11 = zext i8 %arg2 to i32
				%conv13 = zext i8 %arg1 to i32
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %p, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx
				%conv = zext i8 %0 to i32
				%add = shl i32 %conv, 4
				%conv2 = add nuw nsw i32 %add, 32
				%or = or i32 %conv, 51
				%mul = mul nuw nsw i32 %or, 60
				%and = and i32 %conv2, %conv13
				%mul.masked = and i32 %mul, 252
				%conv17 = xor i32 %mul.masked, %conv11
				%mul18 = mul nuw nsw i32 %conv17, %and
				%conv19 = trunc i32 %mul18 to i8
				%arrayidx21 = getelementptr inbounds i8, i8* %q, i64 %indvars.iv
				store i8 %conv19, i8* %arrayidx21
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %len
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @add_f
				; CHECK: load <8 x i16>
				; CHECK: trunc <8 x i16>
				; CHECK: shl <8 x i8>
				; CHECK: add nsw <8 x i8>
				; CHECK: or <8 x i8>
				; CHECK: mul nuw nsw <8 x i8>
				; CHECK: and <8 x i8>
				; CHECK: xor <8 x i8>
				; CHECK: mul nuw nsw <8 x i8>
				; CHECK: store <8 x i8>
				define void @add_f(i16* noalias nocapture readonly %p, i8* noalias nocapture %q, i8 %arg1, i8 %arg2, i32 %len) #0 {
				entry:
				%cmp.32 = icmp sgt i32 %len, 0
				br i1 %cmp.32, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%conv11 = zext i8 %arg2 to i32
				%conv13 = zext i8 %arg1 to i32
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i16, i16* %p, i64 %indvars.iv
				%0 = load i16, i16* %arrayidx
				%conv = sext i16 %0 to i32
				%add = shl i32 %conv, 4
				%conv2 = add nsw i32 %add, 32
				%or = and i32 %conv, 204
				%conv8 = or i32 %or, 51
				%mul = mul nuw nsw i32 %conv8, 60
				%and = and i32 %conv2, %conv13
				%mul.masked = and i32 %mul, 252
				%conv17 = xor i32 %mul.masked, %conv11
				%mul18 = mul nuw nsw i32 %conv17, %and
				%conv19 = trunc i32 %mul18 to i8
				%arrayidx21 = getelementptr inbounds i8, i8* %q, i64 %indvars.iv
				store i8 %conv19, i8* %arrayidx21
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %len
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: @add_g
				; CHECK: load <16 x i8>
				; CHECK: xor <16 x i8>
				; CHECK: icmp ult <16 x i8>
				; CHECK: select <16 x i1> {{.*}}, <16 x i8>
				; CHECK: store <16 x i8>
				define void @add_g(i8* noalias nocapture readonly %p, i8* noalias nocapture readonly %q, i8* noalias nocapture %r, i8 %arg1, i32 %len) #0 {
				%1 = icmp sgt i32 %len, 0
				br i1 %1, label %.lr.ph, label %._crit_edge

				.lr.ph: ; preds = %0
				%2 = sext i8 %arg1 to i64
				br label %3

				._crit_edge: ; preds = %3, %0
				ret void

				; <label>:3 ; preds = %3, %.lr.ph
				%indvars.iv = phi i64 [ 0, %.lr.ph ], [ %indvars.iv.next, %3 ]
				%x4 = getelementptr inbounds i8, i8* %p, i64 %indvars.iv
				%x5 = load i8, i8* %x4
				%x7 = getelementptr inbounds i8, i8* %q, i64 %indvars.iv
				%x8 = load i8, i8* %x7
				%x9 = zext i8 %x5 to i32
				%x10 = xor i32 %x9, 255
				%x11 = icmp ult i32 %x10, 24
				%x12 = select i1 %x11, i32 %x10, i32 24
				%x13 = trunc i32 %x12 to i8
				store i8 %x13, i8* %x4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %len
				br i1 %exitcond, label %._crit_edge, label %3
				}

				attributes #0 = { nounwind }