This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Adding IEEE-754 SIMD detection to loop vectorizer
AbandonedPublic

Authored by rengolin on Feb 11 2016, 7:14 AM.

Download Raw Diff

Details

Reviewers

nadav
aschwaighofer
jmolloy
hfinkel

Summary

Some SIMD implementations are not IEEE-754 compliant, for example ARM's NEON.
This patch teaches the loop vectorizer to only allow transformations of loops
that either contain no floating-point operations or have enough allowance
flags supporting lack of precision, including -ffast-math and other NaN/Inf
related flags.

For that, the target description now has a method which tells us if the
SIMD unit is IEEE-754 compliant, and the vectorizer has a check on every
FP instruction in the candidate loop to check for the safety flags.

Fixes PR16275.

Diff Detail

Event Timeline

rengolin updated this revision to Diff 47643.Feb 11 2016, 7:14 AM

rengolin retitled this revision from to [ARM] Adding IEEE-754 SIMD detection to loop vectorizer.

rengolin updated this object.

rengolin added reviewers: hfinkel, jmolloy, nadav, aschwaighofer.

rengolin set the repository for this revision to rL LLVM.

rengolin added subscribers: llvm-commits, kristof.beyls, aadg.

Herald added subscribers: mzolotukhin, rengolin, aemerson. · View Herald TranscriptFeb 11 2016, 7:14 AM

rengolin added inline comments.Feb 11 2016, 7:20 AM

include/llvm/Analysis/TargetTransformInfo.h
373	An alternative to this would be to have an enum: enum { VFP = 0x1, SIMD = 0x2 } IEEE754Support; and initialise all targets with 0x3, but ARM with 0x1. Then to get the value, we do: IsSIMDIEEE = getIEEE754Support() & IEEE754Support::SIMD; But since this patch doesn't need that, I think we should do that later, when it's needed. Changes to the SLP vectorizer will probably do, and I can change it when I get there.

anemet added a subscriber: anemet.Feb 11 2016, 10:04 AM

Thanks Renato for working on this.

lib/Transforms/Vectorize/LoopVectorize.cpp
4333	During ReductionPHI identification it checks floating point min max is only handled when ‘no-nans-fp-math’ is ON. Probably this behaviour condition needs to be modified. RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(Instruction *I, RecurrenceKind Kind, InstDesc &Prev, bool HasFunNoNaNAttr) { case Instruction::FCmp: case Instruction::ICmp: case Instruction::Select: if (Kind != RK_IntegerMinMax && (!HasFunNoNaNAttr \|\| Kind != RK_FloatMinMax)) return InstDesc(false, I); return isMinMaxSelectCmpPattern(I, Prev);

rengolin added inline comments.Feb 12 2016, 5:37 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4333	That would only stop reduction vectorization. The problem I'm trying to fix is not a vectorization issue, but an ISA one. NEON can't be used for safe-maths operations under any circumstance, so I need to make sure that no loops with FP ops gets ever vectorized without -ffast-math. That includes reductions and inductions. As you see in my tests I check for both behaviours to be correct.

That's not correct for all platforms. Darwin, for example, uses NEON for floating point even in non-fast-math mode. The platform is simply defined not to support subnormals.

In D17141#351589, @resistor wrote:

That's not correct for all platforms. Darwin, for example, uses NEON for floating point even in non-fast-math mode. The platform is simply defined not to support subnormals.

Well, NEON denormal behaviour is independent of the OS running underneath, so this is the correct information. But it may not be the best decision for Darwin, and that's a separate story.

Can you always add a fast-math flag on Darwin builds in the front-end? Or maybe have another CC1 option to control SIMD use?

I don't think adding isDarwin() to the TTI makes any sense.

cheers,
--renato

In D17141#351600, @rengolin wrote:

In D17141#351589, @resistor wrote:

That's not correct for all platforms. Darwin, for example, uses NEON for floating point even in non-fast-math mode. The platform is simply defined not to support subnormals.

Well, NEON denormal behaviour is independent of the OS running underneath, so this is the correct information. But it may not be the best decision for Darwin, and that's a separate story.

It *does* depend on the OS running underneath, and I'm providing an example right here. This is comparable to the fact that many implementations do not provide sNaN support.

Can you always add a fast-math flag on Darwin builds in the front-end? Or maybe have another CC1 option to control SIMD use?

That's not an acceptable solution. Darwin AArch32 specific documents that that subnormals are not supported on the platform, but users can and do expect that code compiled without an explicit -ffast-math will respect other floating point properties, such as preventing floating point reassociation.

I don't think adding isDarwin() to the TTI makes any sense.

You can probably add a hook for "supportsSubnormals", but it'll end up being "!(isAArch32() && isDarwin())"

In D17141#351607, @resistor wrote:

It *does* depend on the OS running underneath, and I'm providing an example right here. This is comparable to the fact that many implementations do not provide sNaN support.

I'm not sure I understand. Are you talking about Apple's implementation of the SIMD unit, or about the OS setting up some different flags?

AFAIK, you can't "turn on" IEEE compliance on NEON with hardware flags. So, if you accept non-IEEE-compliant code in Darwin (the OS?) becomes a platform decision, not a hardware requirement. This flag is about the hardware requirement.

That's not an acceptable solution. Darwin AArch32 specific documents that that subnormals are not supported on the platform, but users can and do expect that code compiled without an explicit -ffast-math will respect other floating point properties, such as preventing floating point reassociation.

Now I'm even more confused. The purpose of fast-math is to relax the FP model, not restrict it. I'm not relaxing more than before, I'm restricting it more.

If Darwin doesn't support subnormals, then how can it use NEON? How can it use the vectorizer as it is, which produces NEON instructions?

In D17141#351645, @rengolin wrote:

In D17141#351607, @resistor wrote:

It *does* depend on the OS running underneath, and I'm providing an example right here. This is comparable to the fact that many implementations do not provide sNaN support.

I'm not sure I understand. Are you talking about Apple's implementation of the SIMD unit, or about the OS setting up some different flags?

AFAIK, you can't "turn on" IEEE compliance on NEON with hardware flags. So, if you accept non-IEEE-compliant code in Darwin (the OS?) becomes a platform decision, not a hardware requirement. This flag is about the hardware requirement.

Darwin, as a policy, maps all floating point onto the NEON unit. As you point out, this discards some elements of IEEE conformance, with subnormal support being the one that is typically brought. Darwin documents this as the correct behavior of the platform, even without -ffast-math.

Now I'm even more confused. The purpose of fast-math is to relax the FP model, not restrict it. I'm not relaxing more than before, I'm restricting it more.

Yes, and restricting it in the way you propose will introduce significant regressions on Darwin. Darwin wants to perform floating point vectorization, even without -ffast-math, provided that the actual constraints like no reassociation are respected. In other words, the reasoning you present here that the NEON unit can only be used with -ffast-math is not true for Darwin.

Darwin, as a policy, maps all floating point onto the NEON unit. As you point out, this discards some elements of IEEE conformance, with subnormal support being the one that is typically brought. Darwin documents this as the correct behavior of the platform, even without -ffast-math.

Ok, a platform choice, not hardware support.

I understand that enabling fast-math will break other promises, so I agree that an extra subnormal property to the target, depending on platform is needed.

I'll work on that.

Thanks!
--renato

Adding supportsSubnormal() call to make sure we always optimise on Darwin.

hfinkel added inline comments.Feb 23 2016, 11:46 AM

include/llvm/Analysis/TargetTransformInfo.h
373	What does one thing have to do with the other (i.e. what does IEEE floating point have to do with allowing fast-math)? The underlying issue is that, without fast-math, the numeric representation, and the operations on numbers in that representation, should be the same. fast-math allows the use of alternate representations and operations (so long as they're not too different), but also allows reassociation. To allow vectorizing reductions, we need reassociation as well (which is a separate matter from the potential change in operational semantics).
lib/Target/ARM/ARMTargetTransformInfo.h
59	This comment seems misleading. "fast math" normally implies allowing things like reassociation (and lack of NaNs, etc.), and that has nothing to do with subnormals.
lib/Transforms/Vectorize/LoopVectorize.cpp
1859	This knowledge should live in the frontend. You can add proper diagnostic information subclass, and then catch it in the frontend to produce front-end specific information.

resistor added inline comments.Feb 23 2016, 11:54 AM

include/llvm/Analysis/TargetTransformInfo.h
373	To pile on a bit, it's not just about SIMD. Darwin uses NEON for scalar floating point as well, rather than VFP.

rengolin added inline comments.Feb 23 2016, 12:49 PM

include/llvm/Analysis/TargetTransformInfo.h
373	I need to make this abundantly clear: this is not about reductions. It's not because fast-math is used to free reductions that it has only to do with reductions. The problem is that IEEE 754 states that any transformation has to have the same semantics as the original intention. This has to do with CSE, strength reduction, vectorization, inlining, etc. So, it's not just because we're not dealing with reductions that we don't care about IEEE compliance. The flag -ffast-math acts as a collection of flags related to rouding, zeroes, infinites, nans, etc. There are many ways to disable specific guarantees of the IEEE standard, but fast-math disables all of them. This is an exageration, but it's also safer. Reducing it to the most localised flag is an optimisation. Disabling it for the general case is a correctness change. Why SIMD? Because the loop vectorizer in particular only uses SIMD instructions, and this is a change in the loop vectorizer. There will be additional SLP changes, but this particular change is only related to the loop vectorizer. One thing at a time. It just happens that ARMv7's SIMD is not IEEE compliant, so I need to detect it and avoid vectorization. I'll need to do the same thing on the SLP vectorizer as well, and only allow VFP instructions. The loop vectorizer does not use VFP instructions, so we should be safe. Other alternatives were discussed (like vectorizing here, but scalarizing later), but that'll lead to bad predictions and likely bad performance and it's not worth the effort. Darwin can continue to use NEON for scalar FP without an issue, this is JUST for the loop vectorizer and will make no difference at all in Darwin. The newly added flags are just informational. The decisions are left for the passes to do.

rengolin added inline comments.Feb 23 2016, 12:51 PM

include/llvm/Analysis/TargetTransformInfo.h
373	To pile on a bit, it's not just about SIMD. Darwin uses NEON for scalar floating point as well, rather than VFP. To be specific, SIMD is the name of the hardware unit, not what you do with it, so this comment is not relevant to the discussion. ARMv7's SIMD will never be IEEE compliant, no matter what you do with it, and that's what the method is conveying.

rengolin added inline comments.Feb 24 2016, 10:04 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
1859	Which knowledge? AFAICS, SIMD IEEE compliance should not be in the front-end, neither should be Darwin's compliance. The way I'm handling in the loop vectorizer is the same as with other restrictions, and it shouldn't affect SLP or other transformations, so moving this to the front-end will impact everything, not just this specific case.

hfinkel added inline comments.Feb 24 2016, 2:49 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
1859	The knowledge of which front-end flags (-ffast-math) and pragmas can change the behavior belongs in the frontend. The messages we send from the backend are often strings, but can be message subclasses that the frontend can interpret to provide frontend-specific information along with that provided by the back end. (I'll happily provide more details if you'd like when. I'm not working off my phone).

rengolin added inline comments.Feb 25 2016, 9:39 AM

lib/Target/ARM/ARMTargetTransformInfo.h
59	Indeed, I'll remove this comment.
lib/Transforms/Vectorize/LoopVectorize.cpp
1859	Right, I think we're going off track here. I was trying to be helpful, but I can change it to just say the SIMD is not IEEE compatible and I can't vectorize unless the user requests floating-point relaxation, and leave -ffast-math as an exercise to the reader. Is that what you meant?

Removing fast-math comments and debug messages.

rengolin updated this revision to Diff 49083.Feb 25 2016, 9:51 AM

hfinkel added inline comments.Feb 26 2016, 6:30 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
1859	SGTM

Ping

I'll work on a different approach.

Drive-by comment:

test/Transforms/LoopVectorize/ARM/arm-ieee-vectorize.ll
1	Do we really need O2 here?

rengolin added inline comments.Mar 23 2016, 12:54 PM

test/Transforms/LoopVectorize/ARM/arm-ieee-vectorize.ll
1	not really, laziness of my part. I'll just add the necessary flags on my next iteration.

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

13 lines

TargetTransformInfoImpl.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

8 lines

Target/

ARM/

ARMTargetTransformInfo.h

4 lines

Transforms/

Vectorize/

LoopVectorize.cpp

50 lines

test/

Transforms/

LoopVectorize/

ARM/

arm-ieee-vectorize.ll

274 lines

Diff 49083

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 362 Lines • ▼ Show 20 Lines	public:
bool enableInterleavedAccessVectorization() const;		bool enableInterleavedAccessVectorization() const;

/// \brief Return hardware support for population count.		/// \brief Return hardware support for population count.
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) const;

/// \brief Return true if the hardware has a fast square-root instruction.		/// \brief Return true if the hardware has a fast square-root instruction.
bool haveFastSqrt(Type *Ty) const;		bool haveFastSqrt(Type *Ty) const;

		/// \brief Return true if SIMD is IEEE 754 compliant.
		bool isSIMDIEEE754() const;

		rengolinAuthorUnsubmitted Done Reply Inline Actions An alternative to this would be to have an enum: enum { VFP = 0x1, SIMD = 0x2 } IEEE754Support; and initialise all targets with 0x3, but ARM with 0x1. Then to get the value, we do: IsSIMDIEEE = getIEEE754Support() & IEEE754Support::SIMD; But since this patch doesn't need that, I think we should do that later, when it's needed. Changes to the SLP vectorizer will probably do, and I can change it when I get there. rengolin: An alternative to this would be to have an enum: enum { VFP = 0x1, SIMD = 0x2…
		hfinkelUnsubmitted Not Done Reply Inline Actions What does one thing have to do with the other (i.e. what does IEEE floating point have to do with allowing fast-math)? The underlying issue is that, without fast-math, the numeric representation, and the operations on numbers in that representation, should be the same. fast-math allows the use of alternate representations and operations (so long as they're not too different), but also allows reassociation. To allow vectorizing reductions, we need reassociation as well (which is a separate matter from the potential change in operational semantics). hfinkel: What does one thing have to do with the other (i.e. what does IEEE floating point have to do…
		resistorUnsubmitted Not Done Reply Inline Actions To pile on a bit, it's not just about SIMD. Darwin uses NEON for scalar floating point as well, rather than VFP. resistor: To pile on a bit, it's not just about SIMD. Darwin uses NEON for scalar floating point as well…
		rengolinAuthorUnsubmitted Not Done Reply Inline Actions To pile on a bit, it's not just about SIMD. Darwin uses NEON for scalar floating point as well, rather than VFP. To be specific, SIMD is the name of the hardware unit, not what you do with it, so this comment is not relevant to the discussion. ARMv7's SIMD will never be IEEE compliant, no matter what you do with it, and that's what the method is conveying. rengolin: > To pile on a bit, it's not just about SIMD. Darwin uses NEON for scalar floating point as…
		rengolinAuthorUnsubmitted Not Done Reply Inline Actions I need to make this abundantly clear: this is not about reductions. It's not because fast-math is used to free reductions that it has only to do with reductions. The problem is that IEEE 754 states that any transformation has to have the same semantics as the original intention. This has to do with CSE, strength reduction, vectorization, inlining, etc. So, it's not just because we're not dealing with reductions that we don't care about IEEE compliance. The flag -ffast-math acts as a collection of flags related to rouding, zeroes, infinites, nans, etc. There are many ways to disable specific guarantees of the IEEE standard, but fast-math disables all of them. This is an exageration, but it's also safer. Reducing it to the most localised flag is an optimisation. Disabling it for the general case is a correctness change. Why SIMD? Because the loop vectorizer in particular only uses SIMD instructions, and this is a change in the loop vectorizer. There will be additional SLP changes, but this particular change is only related to the loop vectorizer. One thing at a time. It just happens that ARMv7's SIMD is not IEEE compliant, so I need to detect it and avoid vectorization. I'll need to do the same thing on the SLP vectorizer as well, and only allow VFP instructions. The loop vectorizer does not use VFP instructions, so we should be safe. Other alternatives were discussed (like vectorizing here, but scalarizing later), but that'll lead to bad predictions and likely bad performance and it's not worth the effort. Darwin can continue to use NEON for scalar FP without an issue, this is JUST for the loop vectorizer and will make no difference at all in Darwin. The newly added flags are just informational. The decisions are left for the passes to do. rengolin: I need to make this abundantly clear: this is not about reductions. It's not because fast-math…
		/// \brief Return true if the target implements floating point sub-normal handling,
		/// ie. if it cares about it on a standard implementation level.
		bool supportsSubnormal() const;

/// \brief Return the expected cost of supporting the floating point operation		/// \brief Return the expected cost of supporting the floating point operation
/// of the specified type.		/// of the specified type.
int getFPOpCost(Type *Ty) const;		int getFPOpCost(Type *Ty) const;

/// \brief Return the expected cost of materializing for the given integer		/// \brief Return the expected cost of materializing for the given integer
/// immediate of the specified type.		/// immediate of the specified type.
int getIntImmCost(const APInt &Imm, Type *Ty) const;		int getIntImmCost(const APInt &Imm, Type *Ty) const;

▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	public:
virtual bool isTypeLegal(Type *Ty) = 0;		virtual bool isTypeLegal(Type *Ty) = 0;
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
virtual unsigned getJumpBufSize() = 0;		virtual unsigned getJumpBufSize() = 0;
virtual bool shouldBuildLookupTables() = 0;		virtual bool shouldBuildLookupTables() = 0;
virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;		virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
virtual bool enableInterleavedAccessVectorization() = 0;		virtual bool enableInterleavedAccessVectorization() = 0;
virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;		virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;
virtual bool haveFastSqrt(Type *Ty) = 0;		virtual bool haveFastSqrt(Type *Ty) = 0;
		virtual bool isSIMDIEEE754() = 0;
		virtual bool supportsSubnormal() = 0;
virtual int getFPOpCost(Type *Ty) = 0;		virtual int getFPOpCost(Type *Ty) = 0;
virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;		virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;
virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual unsigned getNumberOfRegisters(bool Vector) = 0;		virtual unsigned getNumberOfRegisters(bool Vector) = 0;
virtual unsigned getRegisterBitWidth(bool Vector) = 0;		virtual unsigned getRegisterBitWidth(bool Vector) = 0;
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	public:
bool enableInterleavedAccessVectorization() override {		bool enableInterleavedAccessVectorization() override {
return Impl.enableInterleavedAccessVectorization();		return Impl.enableInterleavedAccessVectorization();
}		}
PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {		PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) override {
return Impl.getPopcntSupport(IntTyWidthInBit);		return Impl.getPopcntSupport(IntTyWidthInBit);
}		}
bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }		bool haveFastSqrt(Type *Ty) override { return Impl.haveFastSqrt(Ty); }

		bool isSIMDIEEE754() override { return Impl.isSIMDIEEE754(); }

		bool supportsSubnormal() override { return Impl.supportsSubnormal(); }

int getFPOpCost(Type *Ty) override { return Impl.getFPOpCost(Ty); }		int getFPOpCost(Type *Ty) override { return Impl.getFPOpCost(Ty); }

int getIntImmCost(const APInt &Imm, Type *Ty) override {		int getIntImmCost(const APInt &Imm, Type *Ty) override {
return Impl.getIntImmCost(Imm, Ty);		return Impl.getIntImmCost(Imm, Ty);
}		}
int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,		int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) override {		Type *Ty) override {
return Impl.getIntImmCost(Opc, Idx, Imm, Ty);		return Impl.getIntImmCost(Opc, Idx, Imm, Ty);
▲ Show 20 Lines • Show All 206 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 240 Lines • ▼ Show 20 Lines	public:
bool enableInterleavedAccessVectorization() { return false; }		bool enableInterleavedAccessVectorization() { return false; }

TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {		TTI::PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) {
return TTI::PSK_Software;		return TTI::PSK_Software;
}		}

bool haveFastSqrt(Type *Ty) { return false; }		bool haveFastSqrt(Type *Ty) { return false; }

		bool isSIMDIEEE754() { return true; }

		bool supportsSubnormal() { return true; }

unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }		unsigned getFPOpCost(Type *Ty) { return TargetTransformInfo::TCC_Basic; }

unsigned getIntImmCost(const APInt &Imm, Type *Ty) { return TTI::TCC_Basic; }		unsigned getIntImmCost(const APInt &Imm, Type *Ty) { return TTI::TCC_Basic; }

unsigned getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm,		unsigned getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm,
Type *Ty) {		Type *Ty) {
return TTI::TCC_Free;		return TTI::TCC_Free;
}		}
▲ Show 20 Lines • Show All 260 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 175 Lines • ▼ Show 20 Lines
	TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {			TargetTransformInfo::getPopcntSupport(unsigned IntTyWidthInBit) const {
	return TTIImpl->getPopcntSupport(IntTyWidthInBit);			return TTIImpl->getPopcntSupport(IntTyWidthInBit);
	}			}

	bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {			bool TargetTransformInfo::haveFastSqrt(Type *Ty) const {
	return TTIImpl->haveFastSqrt(Ty);			return TTIImpl->haveFastSqrt(Ty);
	}			}

				bool TargetTransformInfo::isSIMDIEEE754() const {
				return TTIImpl->isSIMDIEEE754();
				}

				bool TargetTransformInfo::supportsSubnormal() const {
				return TTIImpl->supportsSubnormal();
				}

	int TargetTransformInfo::getFPOpCost(Type *Ty) const {			int TargetTransformInfo::getFPOpCost(Type *Ty) const {
	int Cost = TTIImpl->getFPOpCost(Ty);			int Cost = TTIImpl->getFPOpCost(Ty);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	int TargetTransformInfo::getIntImmCost(const APInt &Imm, Type *Ty) const {			int TargetTransformInfo::getIntImmCost(const APInt &Imm, Type *Ty) const {
	int Cost = TTIImpl->getIntImmCost(Imm, Ty);			int Cost = TTIImpl->getIntImmCost(Imm, Ty);
	▲ Show 20 Lines • Show All 223 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
ARMTTIImpl(const ARMTTIImpl &Arg)		ARMTTIImpl(const ARMTTIImpl &Arg)
: BaseT(static_cast<const BaseT &>(Arg)), ST(Arg.ST), TLI(Arg.TLI) {}		: BaseT(static_cast<const BaseT &>(Arg)), ST(Arg.ST), TLI(Arg.TLI) {}
ARMTTIImpl(ARMTTIImpl &&Arg)		ARMTTIImpl(ARMTTIImpl &&Arg)
: BaseT(std::move(static_cast<BaseT &>(Arg))), ST(std::move(Arg.ST)),		: BaseT(std::move(static_cast<BaseT &>(Arg))), ST(std::move(Arg.ST)),
TLI(std::move(Arg.TLI)) {}		TLI(std::move(Arg.TLI)) {}

bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

		bool isSIMDIEEE754() { return false; }

		bool supportsSubnormal() { return !ST->isTargetDarwin(); }
		hfinkelUnsubmitted Not Done Reply Inline Actions This comment seems misleading. "fast math" normally implies allowing things like reassociation (and lack of NaNs, etc.), and that has nothing to do with subnormals. hfinkel: This comment seems misleading. "fast math" normally implies allowing things like reassociation…
		rengolinAuthorUnsubmitted Not Done Reply Inline Actions Indeed, I'll remove this comment. rengolin: Indeed, I'll remove this comment.

/// \name Scalar TTI Implementations		/// \name Scalar TTI Implementations
/// @{		/// @{

using BaseT::getIntImmCost;		using BaseT::getIntImmCost;
int getIntImmCost(const APInt &Imm, Type *Ty);		int getIntImmCost(const APInt &Imm, Type *Ty);

/// @}		/// @}

▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

	Show First 20 Lines • Show All 914 Lines • ▼ Show 20 Lines
	/// Vectorization interleave factor.			/// Vectorization interleave factor.
	Hint Interleave;			Hint Interleave;
	/// Vectorization forced			/// Vectorization forced
	Hint Force;			Hint Force;

	/// Return the loop metadata prefix.			/// Return the loop metadata prefix.
	static StringRef Prefix() { return "llvm.loop."; }			static StringRef Prefix() { return "llvm.loop."; }

				/// True if there is any unsafe math in the loop.
				bool PotentiallyUnsafe;

	public:			public:
	enum ForceKind {			enum ForceKind {
	FK_Undefined = -1, ///< Not selected.			FK_Undefined = -1, ///< Not selected.
	FK_Disabled = 0, ///< Forcing disabled.			FK_Disabled = 0, ///< Forcing disabled.
	FK_Enabled = 1, ///< Forcing enabled.			FK_Enabled = 1, ///< Forcing enabled.
	};			};

	LoopVectorizeHints(const Loop *L, bool DisableInterleaving)			LoopVectorizeHints(const Loop *L, bool DisableInterleaving)
	: Width("vectorize.width", VectorizerParams::VectorizationFactor,			: Width("vectorize.width", VectorizerParams::VectorizationFactor,
	HK_WIDTH),			HK_WIDTH),
	Interleave("interleave.count", DisableInterleaving, HK_UNROLL),			Interleave("interleave.count", DisableInterleaving, HK_UNROLL),
	Force("vectorize.enable", FK_Undefined, HK_FORCE),			Force("vectorize.enable", FK_Undefined, HK_FORCE),
	TheLoop(L) {			PotentiallyUnsafe(false), TheLoop(L) {
	// Populate values with existing loop metadata.			// Populate values with existing loop metadata.
	getHintsFromMetadata();			getHintsFromMetadata();

	// force-vector-interleave overrides DisableInterleaving.			// force-vector-interleave overrides DisableInterleaving.
	if (VectorizerParams::isInterleaveForced())			if (VectorizerParams::isInterleaveForced())
	Interleave.Value = VectorizerParams::VectorizationInterleave;			Interleave.Value = VectorizerParams::VectorizationInterleave;

	DEBUG(if (DisableInterleaving && Interleave.Value == 1) dbgs()			DEBUG(if (DisableInterleaving && Interleave.Value == 1) dbgs()
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	// When enabling loop hints are provided we allow the vectorizer to change			// When enabling loop hints are provided we allow the vectorizer to change
	// the order of operations that is given by the scalar loop. This is not			// the order of operations that is given by the scalar loop. This is not
	// enabled by default because can be unsafe or inefficient. For example,			// enabled by default because can be unsafe or inefficient. For example,
	// reordering floating-point operations will change the way round-off			// reordering floating-point operations will change the way round-off
	// error accumulates in the loop.			// error accumulates in the loop.
	return getForce() == LoopVectorizeHints::FK_Enabled \|\| getWidth() > 1;			return getForce() == LoopVectorizeHints::FK_Enabled \|\| getWidth() > 1;
	}			}

				bool isPotentiallyUnsafe() const {
				// Avoid FP vectorization if the target is unsure about proper support.
				// This may be related to the SIMD unit in the target not handling
				// IEEE 754 FP ops properly, or bad single-to-double promotions.
				// Otherwise, a sequence of vectorized loops, even without reduction,
				// could lead to different end results on the destination vectors.
				return getForce() != LoopVectorizeHints::FK_Enabled && PotentiallyUnsafe;
				}

				void setPotentiallyUnsafe() {
				PotentiallyUnsafe = true;
				}

	private:			private:
	/// Find hints specified in the loop metadata and update local values.			/// Find hints specified in the loop metadata and update local values.
	void getHintsFromMetadata() {			void getHintsFromMetadata() {
	MDNode *LoopID = TheLoop->getLoopID();			MDNode *LoopID = TheLoop->getLoopID();
	if (!LoopID)			if (!LoopID)
	return;			return;

	// First operand should refer to the loop id itself.			// First operand should refer to the loop id itself.
	▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines
	class LoopVectorizationLegality {			class LoopVectorizationLegality {
	public:			public:
	LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,			LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,
	DominatorTree DT, TargetLibraryInfo TLI,			DominatorTree DT, TargetLibraryInfo TLI,
	AliasAnalysis AA, Function F,			AliasAnalysis AA, Function F,
	const TargetTransformInfo *TTI,			const TargetTransformInfo *TTI,
	LoopAccessAnalysis *LAA,			LoopAccessAnalysis *LAA,
	LoopVectorizationRequirements *R,			LoopVectorizationRequirements *R,
	const LoopVectorizeHints *H)			LoopVectorizeHints *H)
	: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),			: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),
	TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(PSE, L, DT),			TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(PSE, L, DT),
	Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),			Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),
	Requirements(R), Hints(H) {}			Requirements(R), Hints(H) {}

	/// ReductionList contains the reduction descriptors for all			/// ReductionList contains the reduction descriptors for all
	/// of the reductions that were found in the loop.			/// of the reductions that were found in the loop.
	typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;			typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;
	▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines

	/// Can we assume the absence of NaNs.			/// Can we assume the absence of NaNs.
	bool HasFunNoNaNAttr;			bool HasFunNoNaNAttr;

	/// Vectorization requirements that will go through late-evaluation.			/// Vectorization requirements that will go through late-evaluation.
	LoopVectorizationRequirements *Requirements;			LoopVectorizationRequirements *Requirements;

	/// Used to emit an analysis of any legality issues.			/// Used to emit an analysis of any legality issues.
	const LoopVectorizeHints *Hints;			LoopVectorizeHints *Hints;

	ValueToValueMap Strides;			ValueToValueMap Strides;
	SmallPtrSet<Value *, 8> StrideSet;			SmallPtrSet<Value *, 8> StrideSet;

	/// While vectorizing these instructions we have to generate a			/// While vectorizing these instructions we have to generate a
	/// call to the appropriate masked intrinsic			/// call to the appropriate masked intrinsic
	SmallPtrSet<const Instruction *, 8> MaskedOp;			SmallPtrSet<const Instruction *, 8> MaskedOp;
	};			};
	▲ Show 20 Lines • Show All 396 Lines • ▼ Show 20 Lines
	emitAnalysisDiag(			emitAnalysisDiag(
	F, L, Hints,			F, L, Hints,
	VectorizationReport()			VectorizationReport()
	<< "loop not vectorized due to NoImplicitFloat attribute");			<< "loop not vectorized due to NoImplicitFloat attribute");
	emitMissedWarning(F, L, Hints);			emitMissedWarning(F, L, Hints);
	return false;			return false;
	}			}

				// Check the target to see if SIMD is IEEE-754 compliant.
				if (Hints.isPotentiallyUnsafe() &&
				TTI->supportsSubnormal() &&
				!TTI->isSIMDIEEE754()) {
				DEBUG(dbgs() << "LV: Can't vectorize FP loops when target's SIMD is not "
				"IEEE-754 compliant.\n");
				emitAnalysisDiag(
				F, L, Hints,
				VectorizationReport()
				<< "non-IEEE-754 compliant SIMD for this target.");
				emitMissedWarning(F, L, Hints);
				hfinkelUnsubmitted Not Done Reply Inline Actions This knowledge should live in the frontend. You can add proper diagnostic information subclass, and then catch it in the frontend to produce front-end specific information. hfinkel: This knowledge should live in the frontend. You can add proper diagnostic information subclass…
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions Which knowledge? AFAICS, SIMD IEEE compliance should not be in the front-end, neither should be Darwin's compliance. The way I'm handling in the loop vectorizer is the same as with other restrictions, and it shouldn't affect SLP or other transformations, so moving this to the front-end will impact everything, not just this specific case. rengolin: Which knowledge? AFAICS, SIMD IEEE compliance should not be in the front-end, neither should…
				hfinkelUnsubmitted Not Done Reply Inline Actions The knowledge of which front-end flags (-ffast-math) and pragmas can change the behavior belongs in the frontend. The messages we send from the backend are often strings, but can be message subclasses that the frontend can interpret to provide frontend-specific information along with that provided by the back end. (I'll happily provide more details if you'd like when. I'm not working off my phone). hfinkel: The knowledge of which front-end flags (-ffast-math) and pragmas can change the behavior…
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions Right, I think we're going off track here. I was trying to be helpful, but I can change it to just say the SIMD is not IEEE compatible and I can't vectorize unless the user requests floating-point relaxation, and leave -ffast-math as an exercise to the reader. Is that what you meant? rengolin: Right, I think we're going off track here. I was trying to be helpful, but I can change it to…
				hfinkelUnsubmitted Not Done Reply Inline Actions SGTM hfinkel: SGTM
				return false;
				}

	// Select the optimal vectorization factor.			// Select the optimal vectorization factor.
	const LoopVectorizationCostModel::VectorizationFactor VF =			const LoopVectorizationCostModel::VectorizationFactor VF =
	CM.selectVectorizationFactor(OptForSize);			CM.selectVectorizationFactor(OptForSize);

	// Select the interleave count.			// Select the interleave count.
	unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);			unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);

	// Get user interleave count.			// Get user interleave count.
	▲ Show 20 Lines • Show All 1,662 Lines • ▼ Show 20 Lines
	<< "loop contains a switch statement");			<< "loop contains a switch statement");
	return false;			return false;
	}			}

	// We must be able to predicate all blocks that need to be predicated.			// We must be able to predicate all blocks that need to be predicated.
	if (blockNeedsPredication(BB)) {			if (blockNeedsPredication(BB)) {
	if (!blockCanBePredicated(BB, SafePointes)) {			if (!blockCanBePredicated(BB, SafePointes)) {
	emitAnalysis(VectorizationReport(BB->getTerminator())			emitAnalysis(VectorizationReport(BB->getTerminator())
	<< "control flow cannot be substituted for a select");			<< "control flow cannot be substituted for a select");
				ashutosh.nemaUnsubmitted Not Done Reply Inline Actions During ReductionPHI identification it checks floating point min max is only handled when ‘no-nans-fp-math’ is ON. Probably this behaviour condition needs to be modified. RecurrenceDescriptor::InstDesc RecurrenceDescriptor::isRecurrenceInstr(Instruction I, RecurrenceKind Kind, InstDesc &Prev, bool HasFunNoNaNAttr) { case Instruction::FCmp: case Instruction::ICmp: case Instruction::Select: if (Kind != RK_IntegerMinMax && (!HasFunNoNaNAttr \|\| Kind != RK_FloatMinMax)) return InstDesc(false, I); return isMinMaxSelectCmpPattern(I, Prev); ashutosh.nema:* During ReductionPHI identification it checks floating point min max is only handled when ‘no…
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions That would only stop reduction vectorization. The problem I'm trying to fix is not a vectorization issue, but an ISA one. NEON can't be used for safe-maths operations under any circumstance, so I need to make sure that no loops with FP ops gets ever vectorized without -ffast-math. That includes reductions and inductions. As you see in my tests I check for both behaviours to be correct. rengolin: That would only stop reduction vectorization. The problem I'm trying to fix is not a…
	return false;			return false;
	}			}
	} else if (BB != Header && !canIfConvertPHINodes(BB)) {			} else if (BB != Header && !canIfConvertPHINodes(BB)) {
	emitAnalysis(VectorizationReport(BB->getTerminator())			emitAnalysis(VectorizationReport(BB->getTerminator())
	<< "control flow cannot be substituted for a select");			<< "control flow cannot be substituted for a select");
	return false;			return false;
	}			}
	}			}
	▲ Show 20 Lines • Show All 303 Lines • ▼ Show 20 Lines
	Type *T = ST->getValueOperand()->getType();			Type *T = ST->getValueOperand()->getType();
	if (!VectorType::isValidElementType(T)) {			if (!VectorType::isValidElementType(T)) {
	emitAnalysis(VectorizationReport(ST) <<			emitAnalysis(VectorizationReport(ST) <<
	"store instruction cannot be vectorized");			"store instruction cannot be vectorized");
	return false;			return false;
	}			}
	if (EnableMemAccessVersioning)			if (EnableMemAccessVersioning)
	collectStridedAccess(ST);			collectStridedAccess(ST);
	}

	if (EnableMemAccessVersioning)			} else if (LoadInst *LI = dyn_cast<LoadInst>(it)) {
	if (LoadInst *LI = dyn_cast<LoadInst>(it))			if (EnableMemAccessVersioning)
	collectStridedAccess(LI);			collectStridedAccess(LI);

				// FP instructions can allow unsafe algebra, thus vectorizable by
				// non-IEEE-754 compliant SIMD units.
				} else if (it->getType()->isFloatingPointTy() &&
				(it->isBinaryOp() \|\| it->isCast()) &&
				!it->hasUnsafeAlgebra()) {
				DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n");
				Hints->setPotentiallyUnsafe();
				}

	// Reduction instructions are allowed to have exit users.			// Reduction instructions are allowed to have exit users.
	// All other instructions must not have external users.			// All other instructions must not have external users.
	if (hasOutsideLoopUser(TheLoop, &*it, AllowedExit)) {			if (hasOutsideLoopUser(TheLoop, &*it, AllowedExit)) {
	emitAnalysis(VectorizationReport(&*it) <<			emitAnalysis(VectorizationReport(&*it) <<
	"value cannot be used outside the loop");			"value cannot be used outside the loop");
	return false;			return false;
	}			}

	▲ Show 20 Lines • Show All 991 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/ARM/arm-ieee-vectorize.ll

This file was added.

				; RUN: opt -O2 -loop-vectorize -S %s -debug-only=loop-vectorize -o /dev/null 2>&1 \| FileCheck %s --check-prefix=LINUX
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Do we really need O2 here? mzolotukhin: Do we really need O2 here?
				rengolinAuthorUnsubmitted Not Done Reply Inline Actions not really, laziness of my part. I'll just add the necessary flags on my next iteration. rengolin: not really, laziness of my part. I'll just add the necessary flags on my next iteration.
				; RUN: opt -mtriple armv7-unknwon-darwin -O2 -loop-vectorize -S %s -debug-only=loop-vectorize -o /dev/null 2>&1 \| FileCheck %s --check-prefix=DARWIN

				; Testing the ability of the loop vectorizer to tell when SIMD is safe or not
				; regarding IEEE 754 standard.
				; On Linux, we only want the vectorizer to work when -ffast-math flag is set,
				; because NEON is not IEEE compliant.
				; Darwin, on the other hand, doesn't support subnormals, and all optimizations
				; are allowed, even without -ffast-math.

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv7--linux-gnueabihf"

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "sumi"
				; CHECK: Found a vectorizable loop (4) in
				define void @sumi(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.06
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %B, i32 %i.06
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i32 %i.06
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Floating-point loops need fast-math to be vectorizeable
				; LINUX: Checking a loop in "sumf"
				; LINUX: Found FP op with unsafe algebra.
				; LINUX: Can't vectorize FP loops when target's SIMD is not IEEE-754 compliant.
				; DARWIN: Checking a loop in "sumf"
				; DARWIN: Found a vectorizable loop (4) in
				define void @sumf(float* noalias nocapture readonly %A, float* noalias nocapture readonly %B, float* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %A, i32 %i.06
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %B, i32 %i.06
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul float %0, %1
				%arrayidx2 = getelementptr inbounds float, float* %C, i32 %i.06
				store float %mul, float* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "redi"
				; CHECK: Found a vectorizable loop (4) in
				define i32 @redi(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi i32 [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i32 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %b, i32 %i.07
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%add = add nsw i32 %mul, %Red.06
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi i32 [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi i32 [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret i32 %Red.0.lcssa
				}

				; Floating-point loops need fast-math to be vectorizeable
				; LINUX: Checking a loop in "redf"
				; LINUX: Found FP op with unsafe algebra.
				; LINUX: Can't vectorize FP loops when target's SIMD is not IEEE-754 compliant.
				; DARWIN: Checking a loop in "redf"
				; DARWIN: Found a vectorizable loop (4) in
				define float @redf(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi float [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %a, i32 %i.07
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %b, i32 %i.07
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul float %0, %1
				%add = fadd float %Red.06, %mul
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi float [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi float [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret float %Red.0.lcssa
				}

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "sumi_fast"
				; CHECK: Found a vectorizable loop (4) in
				define void @sumi_fast(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.06
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %B, i32 %i.06
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i32 %i.06
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Floating-point loops can be vectorizeable with fast-math
				; CHECK: Checking a loop in "sumf_fast"
				; CHECK: Found a vectorizable loop (4) in
				define void @sumf_fast(float* noalias nocapture readonly %A, float* noalias nocapture readonly %B, float* noalias nocapture %C, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.06 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %A, i32 %i.06
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %B, i32 %i.06
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul fast float %1, %0
				%arrayidx2 = getelementptr inbounds float, float* %C, i32 %i.06
				store float %mul, float* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.06, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Integer loops are always vectorizeable
				; CHECK: Checking a loop in "redi_fast"
				; CHECK: Found a vectorizable loop (4) in
				define i32 @redi_fast(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi i32 [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i32 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %b, i32 %i.07
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%add = add nsw i32 %mul, %Red.06
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi i32 [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi i32 [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret i32 %Red.0.lcssa
				}

				; Floating-point loops can be vectorizeable with fast-math
				; CHECK: Checking a loop in "redf_fast"
				; CHECK: Found a vectorizable loop (4) in
				define float @redf_fast(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i32 %N) {
				entry:
				%cmp5 = icmp eq i32 %N, 0
				br i1 %cmp5, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%Red.06 = phi float [ %add, %for.body ], [ undef, %for.body.preheader ]
				%arrayidx = getelementptr inbounds float, float* %a, i32 %i.07
				%0 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %b, i32 %i.07
				%1 = load float, float* %arrayidx1, align 4
				%mul = fmul fast float %1, %0
				%add = fadd fast float %mul, %Red.06
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi float [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%Red.0.lcssa = phi float [ undef, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret float %Red.0.lcssa
				}