This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
2
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/ARM/
-
ARM/
2
ARMTargetTransformInfo.h
-
Transforms/Vectorize/
-
Vectorize/
8
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/ARM/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
arm-ieee-vectorize.ll

Differential D18701

[ARM] Adding IEEE-754 SIMD detection to loop vectorizer
ClosedPublic

Authored by rengolin on Apr 1 2016, 10:19 AM.

Download Raw Diff

Details

Reviewers

rengolin
mzolotukhin
aschwaighofer
jmolloy
resistor
hfinkel
ashutosh.nema

Summary

Some SIMD implementations are not IEEE-754 compliant, for example ARM's NEON.

This patch teaches the loop vectorizer to only allow transformations of loops
that either contain no floating-point operations or have enough allowance
flags supporting lack of precision (ex. -ffast-math, Darwin).

For that, the target description now has a method which tells us if the
vectorizer is allowed to handle FP math without falling into unsafe
representations, plus a check on every FP instruction in the candidate loop
to check for the safety flags.

This commit makes LLVM behave like GCC with respect to ARM NEON support, but
it stops short of fixing the underlying problem: sub-normals. Neither GCC
nor LLVM have a flag for allowing sub-normal operations. Before this patch,
GCC only allows it using unsafe-math flags and LLVM allows it by default with
no way to turn it off (short of not using NEON at all).

As a first step, we push this change to make it safe and in sync with GCC.
The second step is to discuss a new sub-normal's flag on both communitues
and come up with a common solution. The third step is to improve the FastMath
flags in LLVM to encode sub-normals and use those flags to restrict NEON FP.

Fixes PR16275.

Diff Detail

Event Timeline

rengolin updated this revision to Diff 52391.Apr 1 2016, 10:19 AM

rengolin retitled this revision from to [ARM] Adding IEEE-754 SIMD detection to loop vectorizer.

rengolin updated this object.

rengolin added reviewers: hfinkel, aschwaighofer, jmolloy.

rengolin set the repository for this revision to rL LLVM.

rengolin added subscribers: llvm-commits, grosser.

Herald added subscribers: mzolotukhin, rengolin, aemerson. · View Herald TranscriptApr 1 2016, 10:19 AM

Forgot to modify tests to cope with ARMv8 as well as clean it up a bit.

rengolin added a reviewer: mzolotukhin.Apr 2 2016, 1:25 PM

rengolin added reviewers: resistor, ashutosh.nema.

ashutosh.nema added inline comments.Apr 5 2016, 4:27 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4717	You may want to set unsafe flag for some of the intrinsics as well. i.e. intrinsic like minnum, maxnum are floating point unsafe and follows IEEE-754 semantics.

rengolin added inline comments.Apr 5 2016, 5:17 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4717	Hum, that's a good point. But I don't want to be listing specific intrinsics if the list is not very short, or completely target independent. Maybe I should just be safe and do that for all FP operations except move, load and store?

ashutosh.nema added inline comments.Apr 5 2016, 6:59 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4717	Doing this for all operations except move, load and store would be conservative, this might miss eligible cases for vectorization. If we list, then list should be for the safe intrinsic(those can be vectorized) and for others will mark ‘setPotentiallyUnsafe’. Also need to handle function which are later going to be mapped to intrinsic (i.e. fmax, fmaxf). llvm::getIntrinsicIDForCall returns intrinsic ID for vectorizable CallInst, in case it does not found it returns not_intrinsic. Probably you can write a wrapper around this function and check safety. llvm::isTriviallyVectorizable has a list of vectorizable intrinsics, might help during listing.

rengolin added inline comments.Apr 5 2016, 7:54 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4717	Can you guarantee that whatever list you put up will be final? I'd rather be conservative and expect the user to explicitly specify -ffast-math or -fsubnormal-math than get it wrong in the future. There's no point in knowing which functions are vectoriseable if we don't which one of them can be affected by the "unsafe" behaviour on each target. On ARM NEON, it's subnormals, on others may be something else. When we get to the point of creating the subnormal's flag, we'll have also to come up with a table for: is this instruction / call safe on this target? We don't need that for now, as we're assuming nothing is safe, just like GCC. It will inhibit eligible cases, and that's intentional.

Update from reviews:

Adding a FIXME comment to the vectorizer's check, make clear how big the hammer is
Being even more conservative, using all possible types of instructions (now including shifts and function calls, even those that will be builtins)
Changing the CHECK lines to use the legality message, not the final vectorization one, which can change over time and is irrelevant to the patch
Adding two new cases for fabs() with and without fast-math

Checked that the behaviour is consistent with GCC.

Hello,

I currently have an RFC for translating vector math intrinsics to svml calls. This proposal includes the user specifying the desired precision requirements via several flags (supported by the Intel compiler currently). The plan is to attach this information in the form of function attributes at the calls sites of the math intrinsics. In turn, these attributes drive the selection of the appropriate svml function variant. Would this be helpful in this particular case? I've included a description of the flags below.

-fimf-absolute-error=value[:funclist]

       define the maximum allowable absolute error for math library
       function results
         value    - a positive, floating-point number conforming to the
                    format [digits][.digits][{e|E}[sign]digits]
         funclist - optional comma separated list of one or more math
                    library functions to which the attribute should be
                    applied

-fimf-accuracy-bits=bits[:funclist]
       define the relative error, measured by the number of correct bits,
       for math library function results
         bits     - a positive, floating-point number
         funclist - optional comma separated list of one or more math
                    library functions to which the attribute should be
                    applied

-fimf-arch-consistency=value[:funclist]
       ensures that the math library functions produce consistent results
       across different implementations of the same architecture
         value    - true or false
         funclist - optional comma separated list of one or more math
                    library functions to which the attribute should be
                    applied

-fimf-max-error=ulps[:funclist]
       defines the maximum allowable relative error, measured in ulps, for
       math library function results
         ulps     - a positive, floating-point number conforming to the
                    format [digits][.digits][{e|E}[sign]digits]
         funclist - optional comma separated list of one or more math
                    library functions to which the attribute should be
                    applied

-fimf-precision=value[:funclist]
       defines the accuracy (precision) for math library functions
         value    - defined as one of the following values
                    high   - equivalent to max-error = 0.6
                    medium - equivalent to max-error = 4 (DEFAULT)
                    low    - equivalent to accuracy-bits = 11 (single
                             precision); accuracy-bits = 26 (double
                             precision)
         funclist - optional comma separated list of one or more math
                    library functions to which the attribute should be
                    applied

-fimf-domain-exclusion=classlist[:funclist]
       indicates the input arguments domain on which math functions
       must provide correct results.
        classlist - defined as one of the following values
                      nans, infinities, denormals, zeros
                      all, none, common
        funclist - optional list of one or more math library
                   functions to which the attribute should be applied.

Hi Matt,

I've seen your proposal, but I don't think it's relevant in this case. IEEE-753 compliance here is not about precision, but about behaviour.

Also, we already have flags to define most other behaviours from the standard, just not subnormals, so when we introduce the new flag, it'll be in the same group as all the others.

cheers,
--renato

@hfinkel, does this make sense? It's as close as I can get from GCC.

@grosser, is this what you expected?

This won't affect how we benchmark LLVM, since we already use -ffast-math when we don't care about precision anyway.

Forcing check-prefix=CHECK to make sure it gets the common case. I assumed it would by default, apparently, it doesn't.

Ping?

Yes, this is essentially what I had in mind. Two comments:

include/llvm/Analysis/TargetTransformInfo.h
368	The name of this method, and the corresponding comment, seem potentially confusing (no pun intended ;) ). It does not explain which floating-point operations are potentially unsafe (essentially all of them?). I think it would be much better to name this: bool isFPVectorizationPotentiallyUnsafe() const; perhaps with this comment: /// \brief Indicate that it is potentially unsafe to automatically vectorize floating-point operations because the semantics of vector and scalar floating-point semantics may differ. For example, ARM NEON v7 SIMD math does not support IEEE-754 denormal numbers, while depending on the platform, scalar floating-point math does.
lib/Transforms/Vectorize/LoopVectorize.cpp
4727	I don't understand why you have the `(CI \|\| it->isBinaryOp() \|\| it->isCast() \|\| it->isShift())` - why does it matter what kind of operation it is? And do we have any floating-point shifts? If you specifically want to include casts, do you also specifically need to pick up loads and stores?

rengolin added inline comments.Apr 14 2016, 3:55 AM

include/llvm/Analysis/TargetTransformInfo.h
368	I see, and then invert the logic to return "false" by default, as in "it's not unsafe to transform this operation in this target", but on v7 NEON return true because it "is unsafe to do unsafe FP on NEON". I'll change that, as I really hated my original name. :)
lib/Transforms/Vectorize/LoopVectorize.cpp
4727	We don't want to get loads and stores because they're normally safe anyway. That's why I'm using the "else if". But also I don't want to get MOVs and shuffles either, since there's no chance they'll change the semantics or precision unless it's specifically requested (widening/shortening, type change, etc). So I just selected everything else. :) Function calls and binary operations are obvious. Casts may change the type, so they may end up as the case above. I don't know of any FP shifts, so that was probably over the top. I think it should be safe to do: } else if (it->getType()->isFloatingPointTy() && (CI \|\| it->isBinaryOp()) && !it->hasUnsafeAlgebra()) {

hfinkel added inline comments.Apr 14 2016, 7:18 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4727	That makes sense to me. We should add an associated note to the declaration of isFPVectorizationPotentiallyUnsafe indicating this. Something like this perhaps: // This applies to floating-point math operations and calls, not memory operations, shuffles, or casts.

rengolin added inline comments.Apr 14 2016, 7:21 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4727	Will do, thanks!

Thanks Hal,

I've applied the changes you suggested and committed as r266363.

I'll study a bit more on would be a good subnormal flag.

--renato

This revision is now accepted and ready to land.Apr 14 2016, 1:50 PM

rengolin closed this revision.Apr 14 2016, 1:50 PM

Sorry to have missed this before commit.

lib/Target/ARM/ARMTargetTransformInfo.h
58	What's the idea here? The FTZ behaviour doesn't change in AArch32 execution state on ARMv8-A, you're still unsafe. The full IEEE-compliant Advanced SIMD is in AArch64 execution state.

rengolin added inline comments.Apr 18 2016, 3:17 AM

lib/Target/ARM/ARMTargetTransformInfo.h
58	Hum. I've been wrongly advised on this one... Now reading the ARM ARM, I can see that this is the case. I'll change it to match AArch64 only. Thanks!

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

9 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

ARM/

ARMTargetTransformInfo.h

4 lines

Transforms/

Vectorize/

LoopVectorize.cpp

51 lines

test/

Transforms/

LoopVectorize/

ARM/

arm-ieee-vectorize.ll

335 lines

Diff 52780

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 356 Lines • ▼ Show 20 Lines	public:
bool shouldBuildLookupTables() const;		bool shouldBuildLookupTables() const;

/// \brief Don't restrict interleaved unrolling to small loops.		/// \brief Don't restrict interleaved unrolling to small loops.
bool enableAggressiveInterleaving(bool LoopHasReductions) const;		bool enableAggressiveInterleaving(bool LoopHasReductions) const;

/// \brief Enable matching of interleaved access groups.		/// \brief Enable matching of interleaved access groups.
bool enableInterleavedAccessVectorization() const;		bool enableInterleavedAccessVectorization() const;

		/// \brief Enable matching of potentially unsafe floating point math in SIMD.
		/// This mainly requires relaxation of rules regarding IEEE-754 support
		/// (like sub-normals in NEON v7) via unsafe math flags.
		bool enablePotentiallyUnsafeFPVectorization() const;
		hfinkelUnsubmitted Not Done Reply Inline Actions The name of this method, and the corresponding comment, seem potentially confusing (no pun intended ;) ). It does not explain which floating-point operations are potentially unsafe (essentially all of them?). I think it would be much better to name this: bool isFPVectorizationPotentiallyUnsafe() const; perhaps with this comment: /// \brief Indicate that it is potentially unsafe to automatically vectorize floating-point operations because the semantics of vector and scalar floating-point semantics may differ. For example, ARM NEON v7 SIMD math does not support IEEE-754 denormal numbers, while depending on the platform, scalar floating-point math does. hfinkel: The name of this method, and the corresponding comment, seem potentially confusing (no pun…
		rengolinAuthorUnsubmitted Not Done Reply Inline Actions I see, and then invert the logic to return "false" by default, as in "it's not unsafe to transform this operation in this target", but on v7 NEON return true because it "is unsafe to do unsafe FP on NEON". I'll change that, as I really hated my original name. :) rengolin: I see, and then invert the logic to return "false" by default, as in "it's not unsafe to…

/// \brief Return hardware support for population count.		/// \brief Return hardware support for population count.
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;

/// \brief Return true if the hardware has a fast square-root instruction.		/// \brief Return true if the hardware has a fast square-root instruction.
bool haveFastSqrt(Type *Ty) const;		bool haveFastSqrt(Type *Ty) const;

/// \brief Return the expected cost of supporting the floating point operation		/// \brief Return the expected cost of supporting the floating point operation
/// of the specified type.		/// of the specified type.
▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	public:
virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;		virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;
virtual bool isProfitableToHoist(Instruction *I) = 0;		virtual bool isProfitableToHoist(Instruction *I) = 0;
virtual bool isTypeLegal(Type *Ty) = 0;		virtual bool isTypeLegal(Type *Ty) = 0;
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
virtual unsigned getJumpBufSize() = 0;		virtual unsigned getJumpBufSize() = 0;
virtual bool shouldBuildLookupTables() = 0;		virtual bool shouldBuildLookupTables() = 0;
virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;		virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
virtual bool enableInterleavedAccessVectorization() = 0;		virtual bool enableInterleavedAccessVectorization() = 0;
		virtual bool enablePotentiallyUnsafeFPVectorization() = 0;
virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;		virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;
virtual bool haveFastSqrt(Type *Ty) = 0;		virtual bool haveFastSqrt(Type *Ty) = 0;
virtual int getFPOpCost(Type *Ty) = 0;		virtual int getFPOpCost(Type *Ty) = 0;
virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;		virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;
virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	bool shouldBuildLookupTables() override {
return Impl.shouldBuildLookupTables();		return Impl.shouldBuildLookupTables();
}		}
bool enableAggressiveInterleaving(bool LoopHasReductions) override {		bool enableAggressiveInterleaving(bool LoopHasReductions) override {
return Impl.enableAggressiveInterleaving(LoopHasReductions);		return Impl.enableAggressiveInterleaving(LoopHasReductions);
}		}
bool enableInterleavedAccessVectorization() override {		bool enableInterleavedAccessVectorization() override {
return Impl.enableInterleavedAccessVectorization();		return Impl.enableInterleavedAccessVectorization();
}		}
		bool enablePotentiallyUnsafeFPVectorization() override {
		return Impl.enablePotentiallyUnsafeFPVectorization();
		}
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {
return Impl.getPopcntSupport(IntTyWidthInBit);		return Impl.getPopcntSupport(IntTyWidthInBit);
}		}
bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }		bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }

int getFPOpCost(Type *Ty) override { return Impl.getFPOpCost(Ty); }		int getFPOpCost(Type *Ty) override { return Impl.getFPOpCost(Ty); }

int getIntImmCost(const APInt &Imm, Type *Ty) override {		int getIntImmCost(const APInt &Imm, Type *Ty) override {
▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 234 Lines • ▼ Show 20 Lines	public:
unsigned getJumpBufSize() { return 0; }		unsigned getJumpBufSize() { return 0; }

bool shouldBuildLookupTables() { return true; }		bool shouldBuildLookupTables() { return true; }

bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }		bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }

bool enableInterleavedAccessVectorization() { return false; }		bool enableInterleavedAccessVectorization() { return false; }

		bool enablePotentiallyUnsafeFPVectorization() { return true; }

TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {		TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {
return TTI::PSK_Software;		return TTI::PSK_Software;
}		}

bool haveFastSqrt(Type *Ty) { return false; }		bool haveFastSqrt(Type *Ty) { return false; }

unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }		unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }

▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {			bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {
	return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);			return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);
	}			}

	bool TargetTransformInfo::enableInterleavedAccessVectorization() const {			bool TargetTransformInfo::enableInterleavedAccessVectorization() const {
	return TTIImpl->enableInterleavedAccessVectorization();			return TTIImpl->enableInterleavedAccessVectorization();
	}			}

				bool TargetTransformInfo::enablePotentiallyUnsafeFPVectorization() const {
				return TTIImpl->enablePotentiallyUnsafeFPVectorization();
				}

	TargetTransformInfo::PopcntSupportKind			TargetTransformInfo::PopcntSupportKind
	TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {			TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {
	return TTIImpl->getPopcntSupport(IntTyWidthInBit);			return TTIImpl->getPopcntSupport(IntTyWidthInBit);
	}			}

	bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {			bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {
	return TTIImpl->haveFastSqrt(Ty);			return TTIImpl->haveFastSqrt(Ty);
	}			}
	▲ Show 20 Lines • Show All 240 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
ARMTTIImpl(const ARMTTIImpl &Arg)		ARMTTIImpl(const ARMTTIImpl &Arg)
: BaseT(static_cast<const BaseT &>(Arg)), ST(Arg.ST), TLI(Arg.TLI) {}		: BaseT(static_cast<const BaseT &>(Arg)), ST(Arg.ST), TLI(Arg.TLI) {}
ARMTTIImpl(ARMTTIImpl &&Arg)		ARMTTIImpl(ARMTTIImpl &&Arg)
: BaseT(std::move(static_cast<BaseT &>(Arg))), ST(std::move(Arg.ST)),		: BaseT(std::move(static_cast<BaseT &>(Arg))), ST(std::move(Arg.ST)),
TLI(std::move(Arg.TLI)) {}		TLI(std::move(Arg.TLI)) {}

bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

		bool enablePotentiallyUnsafeFPVectorization() {
		return ST->hasFPARMv8() \|\| ST->isTargetDarwin();
		jgreenhalghUnsubmitted Not Done Reply Inline Actions What's the idea here? The FTZ behaviour doesn't change in AArch32 execution state on ARMv8-A, you're still unsafe. The full IEEE-compliant Advanced SIMD is in AArch64 execution state. jgreenhalgh: What's the idea here? The FTZ behaviour doesn't change in AArch32 execution state on ARMv8-A…
		rengolinAuthorUnsubmitted Not Done Reply Inline Actions Hum. I've been wrongly advised on this one... Now reading the ARM ARM, I can see that this is the case. I'll change it to match AArch64 only. Thanks! rengolin: Hum. I've been wrongly advised on this one... Now reading the ARM ARM, I can see that this is…
		}

/// \name Scalar TTI Implementations		/// \name Scalar TTI Implementations
/// @{		/// @{

using BaseT::getIntImmCost;		using BaseT::getIntImmCost;
int getIntImmCost(const APInt &Imm, Type *Ty);		int getIntImmCost(const APInt &Imm, Type *Ty);

/// @}		/// @}

▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

	Show First 20 Lines • Show All 953 Lines • ▼ Show 20 Lines
	/// Vectorization interleave factor.			/// Vectorization interleave factor.
	Hint Interleave;			Hint Interleave;
	/// Vectorization forced			/// Vectorization forced
	Hint Force;			Hint Force;

	/// Return the loop metadata prefix.			/// Return the loop metadata prefix.
	static StringRef Prefix() { return "llvm.loop."; }			static StringRef Prefix() { return "llvm.loop."; }

				/// True if there is any unsafe math in the loop.
				bool PotentiallyUnsafe;

	public:			public:
	enum ForceKind {			enum ForceKind {
	FK_Undefined = -1, ///< Not selected.			FK_Undefined = -1, ///< Not selected.
	FK_Disabled = 0, ///< Forcing disabled.			FK_Disabled = 0, ///< Forcing disabled.
	FK_Enabled = 1, ///< Forcing enabled.			FK_Enabled = 1, ///< Forcing enabled.
	};			};

	LoopVectorizeHints(const Loop *L, bool DisableInterleaving)			LoopVectorizeHints(const Loop *L, bool DisableInterleaving)
	: Width("vectorize.width", VectorizerParams::VectorizationFactor,			: Width("vectorize.width", VectorizerParams::VectorizationFactor,
	HK_WIDTH),			HK_WIDTH),
	Interleave("interleave.count", DisableInterleaving, HK_UNROLL),			Interleave("interleave.count", DisableInterleaving, HK_UNROLL),
	Force("vectorize.enable", FK_Undefined, HK_FORCE),			Force("vectorize.enable", FK_Undefined, HK_FORCE),
	TheLoop(L) {			PotentiallyUnsafe(false), TheLoop(L) {
	// Populate values with existing loop metadata.			// Populate values with existing loop metadata.
	getHintsFromMetadata();			getHintsFromMetadata();

	// force-vector-interleave overrides DisableInterleaving.			// force-vector-interleave overrides DisableInterleaving.
	if (VectorizerParams::isInterleaveForced())			if (VectorizerParams::isInterleaveForced())
	Interleave.Value = VectorizerParams::VectorizationInterleave;			Interleave.Value = VectorizerParams::VectorizationInterleave;

	DEBUG(if (DisableInterleaving && Interleave.Value == 1) dbgs()			DEBUG(if (DisableInterleaving && Interleave.Value == 1) dbgs()
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	// When enabling loop hints are provided we allow the vectorizer to change			// When enabling loop hints are provided we allow the vectorizer to change
	// the order of operations that is given by the scalar loop. This is not			// the order of operations that is given by the scalar loop. This is not
	// enabled by default because can be unsafe or inefficient. For example,			// enabled by default because can be unsafe or inefficient. For example,
	// reordering floating-point operations will change the way round-off			// reordering floating-point operations will change the way round-off
	// error accumulates in the loop.			// error accumulates in the loop.
	return getForce() == LoopVectorizeHints::FK_Enabled \|\| getWidth() > 1;			return getForce() == LoopVectorizeHints::FK_Enabled \|\| getWidth() > 1;
	}			}

				bool isPotentiallyUnsafe() const {
				// Avoid FP vectorization if the target is unsure about proper support.
				// This may be related to the SIMD unit in the target not handling
				// IEEE 754 FP ops properly, or bad single-to-double promotions.
				// Otherwise, a sequence of vectorized loops, even without reduction,
				// could lead to different end results on the destination vectors.
				return getForce() != LoopVectorizeHints::FK_Enabled && PotentiallyUnsafe;
				}

				void setPotentiallyUnsafe() {
				PotentiallyUnsafe = true;
				}

	private:			private:
	/// Find hints specified in the loop metadata and update local values.			/// Find hints specified in the loop metadata and update local values.
	void getHintsFromMetadata() {			void getHintsFromMetadata() {
	MDNode *LoopID = TheLoop->getLoopID();			MDNode *LoopID = TheLoop->getLoopID();
	if (!LoopID)			if (!LoopID)
	return;			return;

	// First operand should refer to the loop id itself.			// First operand should refer to the loop id itself.
	▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines
	class LoopVectorizationLegality {			class LoopVectorizationLegality {
	public:			public:
	LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,			LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,
	DominatorTree DT, TargetLibraryInfo TLI,			DominatorTree DT, TargetLibraryInfo TLI,
	AliasAnalysis AA, Function F,			AliasAnalysis AA, Function F,
	const TargetTransformInfo *TTI,			const TargetTransformInfo *TTI,
	LoopAccessAnalysis *LAA,			LoopAccessAnalysis *LAA,
	LoopVectorizationRequirements *R,			LoopVectorizationRequirements *R,
	const LoopVectorizeHints *H)			LoopVectorizeHints *H)
	: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),			: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),
	TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(PSE, L, DT),			TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(PSE, L, DT),
	Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),			Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),
	Requirements(R), Hints(H) {}			Requirements(R), Hints(H) {}

	/// ReductionList contains the reduction descriptors for all			/// ReductionList contains the reduction descriptors for all
	/// of the reductions that were found in the loop.			/// of the reductions that were found in the loop.
	typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;			typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;
	▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines

	/// Can we assume the absence of NaNs.			/// Can we assume the absence of NaNs.
	bool HasFunNoNaNAttr;			bool HasFunNoNaNAttr;

	/// Vectorization requirements that will go through late-evaluation.			/// Vectorization requirements that will go through late-evaluation.
	LoopVectorizationRequirements *Requirements;			LoopVectorizationRequirements *Requirements;

	/// Used to emit an analysis of any legality issues.			/// Used to emit an analysis of any legality issues.
	const LoopVectorizeHints *Hints;			LoopVectorizeHints *Hints;

	ValueToValueMap Strides;			ValueToValueMap Strides;
	SmallPtrSet<Value *, 8> StrideSet;			SmallPtrSet<Value *, 8> StrideSet;

	/// While vectorizing these instructions we have to generate a			/// While vectorizing these instructions we have to generate a
	/// call to the appropriate masked intrinsic			/// call to the appropriate masked intrinsic
	SmallPtrSet<const Instruction *, 8> MaskedOp;			SmallPtrSet<const Instruction *, 8> MaskedOp;
	};			};
	▲ Show 20 Lines • Show All 407 Lines • ▼ Show 20 Lines
	emitAnalysisDiag(			emitAnalysisDiag(
	F, L, Hints,			F, L, Hints,
	VectorizationReport()			VectorizationReport()
	<< "loop not vectorized due to NoImplicitFloat attribute");			<< "loop not vectorized due to NoImplicitFloat attribute");
	emitMissedWarning(F, L, Hints);			emitMissedWarning(F, L, Hints);
	return false;			return false;
	}			}

				// Check if the target supports potentially unsafe FP vectorization.
				// FIXME: Add a check for the type of safety issue (denormal, signaling)
				// for the target we're vectorizing for, to make sure none of the
				// additional fp-math flags can help.
				if (Hints.isPotentiallyUnsafe() &&
				!TTI->enablePotentiallyUnsafeFPVectorization()) {
				DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n");
				emitAnalysisDiag(
				F, L, Hints,
				VectorizationReport()
				<< "loop not vectorized due to unsafe FP support.");
				emitMissedWarning(F, L, Hints);
				return false;
				}

	// Select the optimal vectorization factor.			// Select the optimal vectorization factor.
	const LoopVectorizationCostModel::VectorizationFactor VF =			const LoopVectorizationCostModel::VectorizationFactor VF =
	CM.selectVectorizationFactor(OptForSize);			CM.selectVectorizationFactor(OptForSize);

	// Select the interleave count.			// Select the interleave count.
	unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);			unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);

	// Get user interleave count.			// Get user interleave count.
	▲ Show 20 Lines • Show All 1,982 Lines • ▼ Show 20 Lines
	Type *T = ST->getValueOperand()->getType();			Type *T = ST->getValueOperand()->getType();
	if (!VectorType::isValidElementType(T)) {			if (!VectorType::isValidElementType(T)) {
	emitAnalysis(VectorizationReport(ST) <<			emitAnalysis(VectorizationReport(ST) <<
	"store instruction cannot be vectorized");			"store instruction cannot be vectorized");
	return false;			return false;
	}			}
	if (EnableMemAccessVersioning)			if (EnableMemAccessVersioning)
	collectStridedAccess(ST);			collectStridedAccess(ST);
	}

	if (EnableMemAccessVersioning)			} else if (LoadInst *LI = dyn_cast<LoadInst>(it)) {
	if (LoadInst *LI = dyn_cast<LoadInst>(it))			if (EnableMemAccessVersioning)
	collectStridedAccess(LI);			collectStridedAccess(LI);

				// FP instructions can allow unsafe algebra, thus vectorizable by
				// non-IEEE-754 compliant SIMD units.
				} else if (it->getType()->isFloatingPointTy() &&
				ashutosh.nemaUnsubmitted Not Done Reply Inline Actions You may want to set unsafe flag for some of the intrinsics as well. i.e. intrinsic like minnum, maxnum are floating point unsafe and follows IEEE-754 semantics. ashutosh.nema: You may want to set unsafe flag for some of the intrinsics as well. i.e. intrinsic like minnum…
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions Hum, that's a good point. But I don't want to be listing specific intrinsics if the list is not very short, or completely target independent. Maybe I should just be safe and do that for all FP operations except move, load and store? rengolin: Hum, that's a good point. But I don't want to be listing specific intrinsics if the list is…
				ashutosh.nemaUnsubmitted Not Done Reply Inline Actions Doing this for all operations except move, load and store would be conservative, this might miss eligible cases for vectorization. If we list, then list should be for the safe intrinsic(those can be vectorized) and for others will mark ‘setPotentiallyUnsafe’. Also need to handle function which are later going to be mapped to intrinsic (i.e. fmax, fmaxf). llvm::getIntrinsicIDForCall returns intrinsic ID for vectorizable CallInst, in case it does not found it returns not_intrinsic. Probably you can write a wrapper around this function and check safety. llvm::isTriviallyVectorizable has a list of vectorizable intrinsics, might help during listing. ashutosh.nema: Doing this for all operations except move, load and store would be conservative, this might…
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions Can you guarantee that whatever list you put up will be final? I'd rather be conservative and expect the user to explicitly specify -ffast-math or -fsubnormal-math than get it wrong in the future. There's no point in knowing which functions are vectoriseable if we don't which one of them can be affected by the "unsafe" behaviour on each target. On ARM NEON, it's subnormals, on others may be something else. When we get to the point of creating the subnormal's flag, we'll have also to come up with a table for: is this instruction / call safe on this target? We don't need that for now, as we're assuming nothing is safe, just like GCC. It will inhibit eligible cases, and that's intentional. rengolin: Can you guarantee that whatever list you put up will be final? I'd rather be conservative and…
				(CI \|\| it->isBinaryOp() \|\| it->isCast() \|\| it->isShift()) &&
				!it->hasUnsafeAlgebra()) {
				DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n");
				Hints->setPotentiallyUnsafe();
				}

	// Reduction instructions are allowed to have exit users.			// Reduction instructions are allowed to have exit users.
	// All other instructions must not have external users.			// All other instructions must not have external users.
	if (hasOutsideLoopUser(TheLoop, &*it, AllowedExit)) {			if (hasOutsideLoopUser(TheLoop, &*it, AllowedExit)) {
	emitAnalysis(VectorizationReport(&*it) <<			emitAnalysis(VectorizationReport(&*it) <<
				hfinkelUnsubmitted Not Done Reply Inline Actions I don't understand why you have the `(CI \|\| it->isBinaryOp() \|\| it->isCast() \|\| it->isShift())` - why does it matter what kind of operation it is? And do we have any floating-point shifts? If you specifically want to include casts, do you also specifically need to pick up loads and stores? hfinkel: I don't understand why you have the `(CI \|\| it->isBinaryOp() \|\| it->isCast() \|\| it->isShift())`…
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions We don't want to get loads and stores because they're normally safe anyway. That's why I'm using the "else if". But also I don't want to get MOVs and shuffles either, since there's no chance they'll change the semantics or precision unless it's specifically requested (widening/shortening, type change, etc). So I just selected everything else. :) Function calls and binary operations are obvious. Casts may change the type, so they may end up as the case above. I don't know of any FP shifts, so that was probably over the top. I think it should be safe to do: } else if (it->getType()->isFloatingPointTy() && (CI \|\| it->isBinaryOp()) && !it->hasUnsafeAlgebra()) { rengolin: We don't want to get loads and stores because they're normally safe anyway. That's why I'm…
				hfinkelUnsubmitted Not Done Reply Inline Actions That makes sense to me. We should add an associated note to the declaration of isFPVectorizationPotentiallyUnsafe indicating this. Something like this perhaps: // This applies to floating-point math operations and calls, not memory operations, shuffles, or casts. hfinkel: That makes sense to me. We should add an associated note to the declaration of…
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions Will do, thanks! rengolin: Will do, thanks!
	"value cannot be used outside the loop");			"value cannot be used outside the loop");
	return false;			return false;
	}			}

	} // next instr.			} // next instr.

	}			}

	▲ Show 20 Lines • Show All 987 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/ARM/arm-ieee-vectorize.ll

This file was added.

				; RUN: opt -mtriple armv7-linux-gnueabihf -loop-vectorize -S %s -debug-only=loop-vectorize -o /dev/null 2>&1 \| FileCheck %s --check-prefix=LINUX-V7
				; RUN: opt -mtriple armv8-linux-gnu -loop-vectorize -S %s -debug-only=loop-vectorize -o /dev/null 2>&1 \| FileCheck %s --check-prefix=LINUX-V8
				; RUN: opt -mtriple armv7-unknwon-darwin -loop-vectorize -S %s -debug-only=loop-vectorize -o /dev/null 2>&1 \| FileCheck %s --check-prefix=DARWIN

				; Testing the ability of the loop vectorizer to tell when SIMD is safe or not
				; regarding IEEE 754 standard.
				; On Linux, we only want the vectorizer to work when -ffast-math flag is set,
				; because NEON is not IEEE compliant.
				; Darwin, on the other hand, doesn't support subnormals, and all optimizations
				; are allowed, even without -ffast-math.

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "sumi"
				; CHECK: We can vectorize this loop!
				define void @sumi(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.06
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %B, i32 %i.06
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i32 %i.06
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Floating-point loops need fast-math to be vectorizeable
				; LINUX-V7: Checking a loop in "sumf"
				; LINUX-V7: Potentially unsafe FP op prevents vectorization
				; LINUX-V8: Checking a loop in "sumf"
				; LINUX-V8: We can vectorize this loop!
				; DARWIN: Checking a loop in "sumf"
				; DARWIN: We can vectorize this loop!
				define void @sumf(float* noalias nocapture readonly %A, float* noalias nocapture readonly %B, float* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %A, i32 %i.06
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %B, i32 %i.06
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul float %0, %1
				%arrayidx2 = getelementptr inbounds float, float* %C, i32 %i.06
				store float %mul, float* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "redi"
				; CHECK: We can vectorize this loop!
				define i32 @redi(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi i32 [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i32 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %b, i32 %i.07
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%add = add nsw i32 %mul, %Red.06
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi i32 [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi i32 [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret i32 %Red.0.lcssa
				}

				; Floating-point loops need fast-math to be vectorizeable
				; LINUX-V7: Checking a loop in "redf"
				; LINUX-V7: Potentially unsafe FP op prevents vectorization
				; LINUX-V8: Checking a loop in "redf"
				; LINUX-V8: We can vectorize this loop!
				; DARWIN: Checking a loop in "redf"
				; DARWIN: We can vectorize this loop!
				define float @redf(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi float [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %a, i32 %i.07
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %b, i32 %i.07
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul float %0, %1
				%add = fadd float %Red.06, %mul
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi float [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi float [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret float %Red.0.lcssa
				}

				; Make sure calls that turn into builtins are also covered
				; LINUX-V7: Checking a loop in "fabs"
				; LINUX-V7: Potentially unsafe FP op prevents vectorization
				; LINUX-V8: Checking a loop in "fabs"
				; LINUX-V8: We can vectorize this loop!
				; DARWIN: Checking a loop in "fabs"
				; DARWIN: We can vectorize this loop!
				define void @fabs(float* noalias nocapture readonly %A, float* noalias nocapture readonly %B, float* noalias nocapture %C, i32 %N) {
				entry:
				%cmp10 = icmp eq i32 %N, 0
				br i1 %cmp10, label %for.end, label %for.body

				for.body: ; preds = %entry, %for.body
				%i.011 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds float, float* %A, i32 %i.011
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %B, i32 %i.011
				%1 = load float, float* %arrayidx1, align 4
				%fabsf = tail call float @fabsf(float %1) #1
				%conv3 = fmul float %0, %fabsf
				%arrayidx4 = getelementptr inbounds float, float* %C, i32 %i.011
				store float %conv3, float* %arrayidx4, align 4
				%inc = add nuw nsw i32 %i.011, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				ret void
				}

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "sumi_fast"
				; CHECK: We can vectorize this loop!
				define void @sumi_fast(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.06
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %B, i32 %i.06
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i32 %i.06
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Floating-point loops can be vectorizeable with fast-math
				; CHECK: Checking a loop in "sumf_fast"
				; CHECK: We can vectorize this loop!
				define void @sumf_fast(float* noalias nocapture readonly %A, float* noalias nocapture readonly %B, float* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %A, i32 %i.06
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %B, i32 %i.06
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul fast float %1, %0
				%arrayidx2 = getelementptr inbounds float, float* %C, i32 %i.06
				store float %mul, float* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "redi_fast"
				; CHECK: We can vectorize this loop!
				define i32 @redi_fast(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi i32 [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i32 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %b, i32 %i.07
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%add = add nsw i32 %mul, %Red.06
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi i32 [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi i32 [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret i32 %Red.0.lcssa
				}

				; Floating-point loops can be vectorizeable with fast-math
				; CHECK: Checking a loop in "redf_fast"
				; CHECK: We can vectorize this loop!
				define float @redf_fast(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi float [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %a, i32 %i.07
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %b, i32 %i.07
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul fast float %1, %0
				%add = fadd fast float %mul, %Red.06
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi float [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi float [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret float %Red.0.lcssa
				}

				; Make sure calls that turn into builtins are also covered
				; CHECK: Checking a loop in "fabs_fast"
				; CHECK: We can vectorize this loop!
				define void @fabs_fast(float* noalias nocapture readonly %A, float* noalias nocapture readonly %B, float* noalias nocapture %C, i32 %N) {
				entry:
				%cmp10 = icmp eq i32 %N, 0
				br i1 %cmp10, label %for.end, label %for.body

				for.body: ; preds = %entry, %for.body
				%i.011 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds float, float* %A, i32 %i.011
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %B, i32 %i.011
				%1 = load float, float* %arrayidx1, align 4
				%fabsf = tail call fast float @fabsf(float %1) #2
				%conv3 = fmul fast float %fabsf, %0
				%arrayidx4 = getelementptr inbounds float, float* %C, i32 %i.011
				store float %conv3, float* %arrayidx4, align 4
				%inc = add nuw nsw i32 %i.011, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				ret void
				}

				declare float @fabsf(float)

				attributes #1 = { nounwind readnone "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a8" "target-features"="+dsp,+neon,+vfp3" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #2 = { nounwind readnone "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="cortex-a8" "target-features"="+dsp,+neon,+vfp3" "unsafe-fp-math"="true" "use-soft-float"="false" }