This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/ARM/
-
ARM/
-
ARMTargetTransformInfo.h
-
Transforms/Scalar/
-
Scalar/
-
LoopStrengthReduce.cpp
-
test/
-
CodeGen/ARM/
-
ARM/
-
dsp-loop-indexing.ll
-
loop-align-cortex-m.ll
-
loop-indexing.ll
-
Transforms/LoopStrengthReduce/ARM/
-
LoopStrengthReduce/
-
ARM/
-
complexity.ll

Differential D55373

[LSR] Generate formulae to enable more indexed accesses
ClosedPublic

Authored by samparker on Dec 6 2018, 8:08 AM.

Download Raw Diff

Details

Reviewers

qcolombet
gilr
kparzysz

Commits

rG67756c09f21a: [LSR] Generate cross iteration indexes
rL353403: [LSR] Generate cross iteration indexes

Summary

Modify GenerateConstantOffsetsImpl to create offsets that can be used by indexed addressing modes. If formulae can be generated which result in the constant offset being the same size as the recurrence, we can generate an indexed access.

The resulting code, at least for Arm, is that usually pre-indexed loads are used as the last access, but sometimes the first.

@kparzysz Would you be able to provide feedback on how this effects Hexagon? It's a target that I don't build and I haven't looked at the tests, but I'm assuming this would interest you.

Diff Detail

Repository: rL LLVM

Event Timeline

samparker created this revision.Dec 6 2018, 8:08 AM

Herald added subscribers: kristof.beyls, javed.absar. · View Herald TranscriptDec 6 2018, 8:08 AM

gilr added inline comments.Dec 9 2018, 8:49 AM

test/CodeGen/ARM/loop-align-cortex-m.ll
3 ↗	(On Diff #176977)	You're just fixing this by the way, right?
test/Transforms/LoopStrengthReduce/ARM/complexity.ll
10 ↗	(On Diff #176977)	Shouldn't -DEFAULT be removed here too?

dmgreen added a subscriber: dmgreen.Dec 10 2018, 3:26 AM

samparker marked 2 inline comments as done.Dec 11 2018, 12:51 AM

samparker added inline comments.

test/CodeGen/ARM/loop-align-cortex-m.ll
3 ↗	(On Diff #176977)	yes
test/Transforms/LoopStrengthReduce/ARM/complexity.ll
10 ↗	(On Diff #176977)	thanks!

I've run some tests and the results are not great for us. On some tests we got up to 5.5% improvements, but there are a lot of severe degradations (15+% worse). If this patch goes in, we'd like to be able to opt out.

Okay, thanks. We're also seeing some regressions, so I know I've got some tuning to do. Do you have any idea of the characteristics of your regressions? At the moment I'm thinking:

That the costs that I've added here are overly simplistic, for one I think I need to add a setup cost.
It's also probably not worth doing when we know that the loop iteration count is low.
In the current state, we also see code size regressions whereas your previous work helps us reduce code size. It may mean that I'll need a different flag to enable this change, but it also maybe a symptom of the performance regressions.

gilr added inline comments.Dec 12 2018, 12:13 AM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3793 ↗	(On Diff #176977)	Is this relevant for non-Address kind formulae?
3793 ↗	(On Diff #176977)	Worth a comment here to describe the motivation and how adjusting the offset generates the post-inc opportunity.
3795 ↗	(On Diff #176977)	How does a non-constant step (e.g. 50 + %x) translate to post-inc? Could you add such a test case?

samparker marked 2 inline comments as done.Dec 12 2018, 1:09 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3793 ↗	(On Diff #176977)	No. I had naively assumed that this function was only generating formulae for addresses... so if that's not the case, I'll make the change here.
3795 ↗	(On Diff #176977)	Ah, good catch. This should only trigger for constant steps.

I've moved the logic under the control of a new TTI flag as it seems that the current shouldFavourPostInc is trying to achieve different things. Hopefully I've also addressed Gil's comments.

gilr added inline comments.Dec 16 2018, 5:56 AM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1258 ↗	(On Diff #177872)	The existing `{LI, +, C}` pattern seems to already match this case (i.e. `{(-C + %a + ...), +, C}`) and the more general case `{(Offset - C2) + %a + ...), +, C2}`, where Offset =/= 0. So IIUC matching this pattern here is only needed if the new TTI flag is set but the existing one is reset, right? Won't this also match something like `{3, +, 5, +, %x}`?
1384 ↗	(On Diff #177872)	IIRC single-statement clauses shouldn't get curly braces.
3802 ↗	(On Diff #177872)	The new TTI API relates to the same HW feature, right? Why not use a cl::opt in LSR to turn this optimization off?

samparker marked an inline comment as done.Dec 19 2018, 1:27 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1258 ↗	(On Diff #177872)	I'm going to need to replace this with something that compares the step with the base offset, aiming to have the post increment happen on the last, not the first, access. I need to make a few other changes to support this though as well.

samparker marked an inline comment as done.Dec 19 2018, 4:49 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3802 ↗	(On Diff #177872)	When I've got the next patch ready, I will re-run my tests and see if this is viable. It may not be good for all targets, and if so, I think a target hook would be more useful.

Added a command line option to 'EnableBackedgePostIncs'.
Added a command line option to enable narrowing the search space by collapsing unrolled code.
The TTI hook now accepts the loop so that the target can make a more informed decision on when it 'shouldFavorBackedgePostIncs'.
LSRInstance contains a boolean 'FavorBackedgePostInc' which is equal to EnableBackedgePostIncs && shouldFavorBackedgePostIncs.
When FavorBackedgePostIncs:
- Generate the new constant offsets.
- In RateRegister, the LoopCost is now set to 0 if the step recurrence is equal to the base offset of the parent formula.
- IsProfitableChain has a higher limit
- The last expression in the IVChain is not added to IVIncSet so it's a target for optimising.

samparker retitled this revision from [LSR] Generate formulae to enable more post-incs to [LSR] Generate formulae to enable more indexed accesses.Jan 11 2019, 7:22 AM

samparker edited the summary of this revision. (Show Details)

Herald added a subscriber: arphaman. · View Herald TranscriptJan 11 2019, 7:22 AM

some renaming, renamed post inc bits to 'index'.

Changed the default value for the command line option 'EnableBackedgeIndexing' to false.

Removed changes to the complexity.ll test.

samparker added a child revision: D56719: [DAGCombine] Enable more pre-indexed stores.Jan 15 2019, 6:57 AM

samparker removed a child revision: D56719: [DAGCombine] Enable more pre-indexed stores.Jan 16 2019, 6:30 AM

ping

In D55373#1367469, @samparker wrote:

ping

Are you still seeing regressions? Can you provide performance data with the current patch?

Hi Hal,

I've attached a graph to show geomean performance results of a popular embedded benchmark suite running on Arm microcontroller. The suite is broken into five sub-suites and only three of which are affected by these changes. The large regression comes from a test which often exhibits bi-modal behaviour, so I could be getting unlucky. The whole suite comprises of several dozen tests and I only four of the benchmarks have minor regressions across the three runs. There's also some backend work that I need to do for Arm to get this optimised properly.

gilr added inline comments.Jan 28 2019, 5:52 AM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗	(On Diff #181776)	False by default?
4466 ↗	(On Diff #181776)	The convention for flags controlling the narrowing heuristics seem to be to use them in NarrowSearchSpaceUsingHeuristics() rather than in the the functions they affect.
4466 ↗	(On Diff #181776)	I think you meant '\|\|' here.

samparker marked an inline comment as done.Jan 29 2019, 1:20 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗	(On Diff #181776)	This option has been introduced to force the collapsing, even if the complexity limit hasn't been reached. Which is why it's implemented differently to the other, similar, options.

gilr added inline comments.Feb 3 2019, 2:27 PM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗	(On Diff #181776)	Aaargh, sorry, read it backwards ... Is it needed since unrolled code doesn't initially go into the same LSRUse? If so, can FavorBackedgeIndex be used to have initial construction put them in the same LSRUse?
2904 ↗	(On Diff #181776)	Deserves a comment.
3121 ↗	(On Diff #181776)	I'm a bit confused here. IIUC a profitable complete chain prvides an efficient solution using post-increments. Why break the last increment out?

samparker marked 2 inline comments as done.Feb 4 2019, 3:36 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗	(On Diff #181776)	I will look into this, thanks.
3121 ↗	(On Diff #181776)	This was so that the last access could also be a target for optimisation. I'll remove the change and do some testing again to see the difference.

Made some simplifications:

reset isProfitableChain.
reset FinalizeChain so that the tail is added to the chain again.
removed the CollapseUnrolled option because with the reset in changes, it wasn't really interesting.

I'll post the updated performance numbers.

gilr added inline comments.Feb 6 2019, 12:40 PM

include/llvm/Analysis/TargetTransformInfo.h
489 ↗	(On Diff #185268)	Please add a doxygen comment.
lib/Target/ARM/ARMTargetTransformInfo.h
97 ↗	(On Diff #185268)	Is this optimization inherently code-size unfriendly for ARM? (The patch actually reduces the instruction count in LSR's Cost when this optimization kicks in)
99 ↗	(On Diff #185268)	Is the single-block constraint due to CodeGen's single-block optimization scope? (If so, then IINM it's not target-specific)

samparker marked 2 inline comments as done.Feb 7 2019, 1:35 AM

samparker added inline comments.

lib/Target/ARM/ARMTargetTransformInfo.h
97 ↗	(On Diff #185268)	There's two reasons really: the transform is most useful in 'unrolled' loops (which we disable when optimising for code size), and this transform will introduce instructions into the preheader and if the address can't be kept in the same register, we'll also produce moves. So this is mainly a defensive restriction, because I haven't been tracking code size, but I would hope that I can remove the restriction later.
99 ↗	(On Diff #185268)	No, its not because of ISel restrictions or anything like that. It's because the transform is only likely to be useful is the address can be kept in the same register - which becomes increasingly less likely once multiple blocks are considered.

Added doxygen comment.

gilr added inline comments.Feb 7 2019, 2:26 AM

lib/Target/ARM/ARMTargetTransformInfo.h
99 ↗	(On Diff #185268)	So both the code-size and single-block heuristics seem target-independent. Why not do this in LSR?

samparker marked an inline comment as done.Feb 7 2019, 2:58 AM

samparker added inline comments.

lib/Target/ARM/ARMTargetTransformInfo.h
99 ↗	(On Diff #185268)	I'd argue because different backends would come to different conclusions to me. I've gone for a very simplistic heuristic, but it would be good to consider register pressure rather than just the number of blocks. Also, some targets may not support indexed accesses on certain types of memory operations and so it worthless generating formulae for them.

LGTM!
(One last nitpick: I'd consider letting EnableBackedgeIndexing default to true as TTI's default already disables it for Hexagon and all other targets)

This revision is now accepted and ready to land.Feb 7 2019, 4:28 AM

Will do. Thanks for the review!

Closed by commit rL353403: [LSR] Generate cross iteration indexes (authored by sam_parker). · Explain WhyFeb 7 2019, 5:33 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 5:33 AM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

TargetTransformInfo.h

8 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

ARM/

ARMTargetTransformInfo.h

6 lines

Transforms/

Scalar/

LoopStrengthReduce.cpp

90 lines

test/

CodeGen/

ARM/

dsp-loop-indexing.ll

310 lines

loop-align-cortex-m.ll

4 lines

loop-indexing.ll

1190 lines

Transforms/

LoopStrengthReduce/

ARM/

complexity.ll

24 lines

Diff 185750

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 480 Lines • ▼ Show 20 Lines	public:
/// Loop-strength-reduction (LSR) uses that knowledge to adjust its cost		/// Loop-strength-reduction (LSR) uses that knowledge to adjust its cost
/// calculation for the instructions in a loop.		/// calculation for the instructions in a loop.
bool canMacroFuseCmp() const;		bool canMacroFuseCmp() const;

/// \return True is LSR should make efforts to create/preserve post-inc		/// \return True is LSR should make efforts to create/preserve post-inc
/// addressing mode expressions.		/// addressing mode expressions.
bool shouldFavorPostInc() const;		bool shouldFavorPostInc() const;

		/// Return true if LSR should make efforts to generate indexed addressing
		/// modes that operate across loop iterations.
		bool shouldFavorBackedgeIndex(const Loop *L) const;

/// Return true if the target supports masked load/store		/// Return true if the target supports masked load/store
/// AVX2 and AVX-512 targets allow masks for consecutive load and store		/// AVX2 and AVX-512 targets allow masks for consecutive load and store
bool isLegalMaskedStore(Type *DataType) const;		bool isLegalMaskedStore(Type *DataType) const;
bool isLegalMaskedLoad(Type *DataType) const;		bool isLegalMaskedLoad(Type *DataType) const;

/// Return true if the target supports masked gather/scatter		/// Return true if the target supports masked gather/scatter
/// AVX-512 fully supports gather and scatter for vectors with 32 and 64		/// AVX-512 fully supports gather and scatter for vectors with 32 and 64
/// bits scalar type.		/// bits scalar type.
▲ Show 20 Lines • Show All 563 Lines • ▼ Show 20 Lines	virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace,		unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) = 0;		TargetTransformInfo::LSRCost &C2) = 0;
virtual bool canMacroFuseCmp() = 0;		virtual bool canMacroFuseCmp() = 0;
virtual bool shouldFavorPostInc() const = 0;		virtual bool shouldFavorPostInc() const = 0;
		virtual bool shouldFavorBackedgeIndex(const Loop *L) const = 0;
virtual bool isLegalMaskedStore(Type *DataType) = 0;		virtual bool isLegalMaskedStore(Type *DataType) = 0;
virtual bool isLegalMaskedLoad(Type *DataType) = 0;		virtual bool isLegalMaskedLoad(Type *DataType) = 0;
virtual bool isLegalMaskedScatter(Type *DataType) = 0;		virtual bool isLegalMaskedScatter(Type *DataType) = 0;
virtual bool isLegalMaskedGather(Type *DataType) = 0;		virtual bool isLegalMaskedGather(Type *DataType) = 0;
virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;		virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;
virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;		virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;
virtual bool prefersVectorizedAddressing() = 0;		virtual bool prefersVectorizedAddressing() = 0;
virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
▲ Show 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
return Impl.isLSRCostLess(C1, C2);		return Impl.isLSRCostLess(C1, C2);
}		}
bool canMacroFuseCmp() override {		bool canMacroFuseCmp() override {
return Impl.canMacroFuseCmp();		return Impl.canMacroFuseCmp();
}		}
bool shouldFavorPostInc() const override {		bool shouldFavorPostInc() const override {
return Impl.shouldFavorPostInc();		return Impl.shouldFavorPostInc();
}		}
		bool shouldFavorBackedgeIndex(const Loop *L) const override {
		return Impl.shouldFavorBackedgeIndex(L);
		}
bool isLegalMaskedStore(Type *DataType) override {		bool isLegalMaskedStore(Type *DataType) override {
return Impl.isLegalMaskedStore(DataType);		return Impl.isLegalMaskedStore(DataType);
}		}
bool isLegalMaskedLoad(Type *DataType) override {		bool isLegalMaskedLoad(Type *DataType) override {
return Impl.isLegalMaskedLoad(DataType);		return Impl.isLegalMaskedLoad(DataType);
}		}
bool isLegalMaskedScatter(Type *DataType) override {		bool isLegalMaskedScatter(Type *DataType) override {
return Impl.isLegalMaskedScatter(DataType);		return Impl.isLegalMaskedScatter(DataType);
▲ Show 20 Lines • Show All 412 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	return std::tie(C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds,
std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,		std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,
C2.ScaleCost, C2.ImmCost, C2.SetupCost);		C2.ScaleCost, C2.ImmCost, C2.SetupCost);
}		}

bool canMacroFuseCmp() { return false; }		bool canMacroFuseCmp() { return false; }

bool shouldFavorPostInc() const { return false; }		bool shouldFavorPostInc() const { return false; }

		bool shouldFavorBackedgeIndex(const Loop *L) const { return false; }

bool isLegalMaskedStore(Type *DataType) { return false; }		bool isLegalMaskedStore(Type *DataType) { return false; }

bool isLegalMaskedLoad(Type *DataType) { return false; }		bool isLegalMaskedLoad(Type *DataType) { return false; }

bool isLegalMaskedScatter(Type *DataType) { return false; }		bool isLegalMaskedScatter(Type *DataType) { return false; }

bool isLegalMaskedGather(Type *DataType) { return false; }		bool isLegalMaskedGather(Type *DataType) { return false; }

▲ Show 20 Lines • Show All 603 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::canMacroFuseCmp() const {			bool TargetTransformInfo::canMacroFuseCmp() const {
	return TTIImpl->canMacroFuseCmp();			return TTIImpl->canMacroFuseCmp();
	}			}

	bool TargetTransformInfo::shouldFavorPostInc() const {			bool TargetTransformInfo::shouldFavorPostInc() const {
	return TTIImpl->shouldFavorPostInc();			return TTIImpl->shouldFavorPostInc();
	}			}

				bool TargetTransformInfo::shouldFavorBackedgeIndex(const Loop *L) const {
				return TTIImpl->shouldFavorBackedgeIndex(L);
				}

	bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {
	return TTIImpl->isLegalMaskedStore(DataType);			return TTIImpl->isLegalMaskedStore(DataType);
	}			}

	bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {
	return TTIImpl->isLegalMaskedLoad(DataType);			return TTIImpl->isLegalMaskedLoad(DataType);
	}			}

	▲ Show 20 Lines • Show All 1,047 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	explicit ARMTTIImpl(const ARMBaseTargetMachine *TM, const Function &F)
: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),		: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),
TLI(ST->getTargetLowering()) {}		TLI(ST->getTargetLowering()) {}

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

		bool shouldFavorBackedgeIndex(const Loop *L) const {
		if (L->getHeader()->getParent()->optForSize())
		return false;
		return ST->isMClass() && ST->isThumb2() && L->getNumBlocks() == 1;
		}

/// Floating-point computation using ARMv8 AArch32 Advanced		/// Floating-point computation using ARMv8 AArch32 Advanced
/// SIMD instructions remains unchanged from ARMv7. Only AArch64 SIMD		/// SIMD instructions remains unchanged from ARMv7. Only AArch64 SIMD
/// is IEEE-754 compliant, but it's not covered in this target.		/// is IEEE-754 compliant, but it's not covered in this target.
bool isFPVectorizationPotentiallyUnsafe() {		bool isFPVectorizationPotentiallyUnsafe() {
return !ST->isTargetDarwin();		return !ST->isTargetDarwin();
}		}

/// \name Scalar TTI Implementations		/// \name Scalar TTI Implementations
▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Scalar/LoopStrengthReduce.cpp

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines

// Flag to narrow search space by filtering non-optimal formulae with		// Flag to narrow search space by filtering non-optimal formulae with
// the same ScaledReg and Scale.		// the same ScaledReg and Scale.
static cl::opt<bool> FilterSameScaledReg(		static cl::opt<bool> FilterSameScaledReg(
"lsr-filter-same-scaled-reg", cl::Hidden, cl::init(true),		"lsr-filter-same-scaled-reg", cl::Hidden, cl::init(true),
cl::desc("Narrow LSR search space by filtering non-optimal formulae"		cl::desc("Narrow LSR search space by filtering non-optimal formulae"
" with the same ScaledReg and Scale"));		" with the same ScaledReg and Scale"));

		static cl::opt<bool> EnableBackedgeIndexing(
		"lsr-backedge-indexing", cl::Hidden, cl::init(true),
		cl::desc("Enable the generation of cross iteration indexed memops"));

static cl::opt<unsigned> ComplexityLimit(		static cl::opt<unsigned> ComplexityLimit(
"lsr-complexity-limit", cl::Hidden,		"lsr-complexity-limit", cl::Hidden,
cl::init(std::numeric_limits<uint16_t>::max()),		cl::init(std::numeric_limits<uint16_t>::max()),
cl::desc("LSR search space complexity limit"));		cl::desc("LSR search space complexity limit"));

#ifndef NDEBUG		#ifndef NDEBUG
// Stress test IV chain generation.		// Stress test IV chain generation.
static cl::opt<bool> StressIVChain(		static cl::opt<bool> StressIVChain(
▲ Show 20 Lines • Show All 882 Lines • ▼ Show 20 Lines	void RateFormula(const TargetTransformInfo &TTI,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const LSRUse &LU,		const LSRUse &LU,
SmallPtrSetImpl<const SCEV > LoserRegs = nullptr);		SmallPtrSetImpl<const SCEV > LoserRegs = nullptr);

void print(raw_ostream &OS) const;		void print(raw_ostream &OS) const;
void dump() const;		void dump() const;

private:		private:
void RateRegister(const SCEV *Reg,		void RateRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const TargetTransformInfo &TTI);		const TargetTransformInfo &TTI);
void RatePrimaryRegister(const SCEV *Reg,		void RatePrimaryRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
SmallPtrSetImpl<const SCEV > LoserRegs,		SmallPtrSetImpl<const SCEV > LoserRegs,
const TargetTransformInfo &TTI);		const TargetTransformInfo &TTI);
};		};

/// An operand value in an instruction which is to be replaced with some		/// An operand value in an instruction which is to be replaced with some
▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines

static bool isAMCompletelyFolded(const TargetTransformInfo &TTI,		static bool isAMCompletelyFolded(const TargetTransformInfo &TTI,
LSRUse::KindType Kind, MemAccessTy AccessTy,		LSRUse::KindType Kind, MemAccessTy AccessTy,
GlobalValue *BaseGV, int64_t BaseOffset,		GlobalValue *BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
Instruction *Fixup = nullptr);		Instruction *Fixup = nullptr);

/// Tally up interesting quantities from the given register.		/// Tally up interesting quantities from the given register.
void Cost::RateRegister(const SCEV *Reg,		void Cost::RateRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Reg)) {		if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Reg)) {
// If this is an addrec for another loop, it should be an invariant		// If this is an addrec for another loop, it should be an invariant
// with respect to L since L is the innermost loop (at least		// with respect to L since L is the innermost loop (at least
// for now LSR only handles innermost loops).		// for now LSR only handles innermost loops).
Show All 10 Lines	if (AR->getLoop() != L) {
}		}

// Otherwise, it will be an invariant with respect to Loop L.		// Otherwise, it will be an invariant with respect to Loop L.
++C.NumRegs;		++C.NumRegs;
return;		return;
}		}

unsigned LoopCost = 1;		unsigned LoopCost = 1;
		if (TTI.isIndexedLoadLegal(TTI.MIM_PostInc, AR->getType()) \|\|
		TTI.isIndexedStoreLegal(TTI.MIM_PostInc, AR->getType())) {

		// If the step size matches the base offset, we could use pre-indexed
		// addressing.
		if (TTI.shouldFavorBackedgeIndex(L)) {
		if (auto *Step = dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE)))
		if (Step->getAPInt() == F.BaseOffset)
		LoopCost = 0;
		}

if (TTI.shouldFavorPostInc()) {		if (TTI.shouldFavorPostInc()) {
const SCEV *LoopStep = AR->getStepRecurrence(SE);		const SCEV *LoopStep = AR->getStepRecurrence(SE);
if (isa<SCEVConstant>(LoopStep)) {		if (isa<SCEVConstant>(LoopStep)) {
// Check if a post-indexed load/store can be used.
if (TTI.isIndexedLoadLegal(TTI.MIM_PostInc, AR->getType()) \|\|
TTI.isIndexedStoreLegal(TTI.MIM_PostInc, AR->getType())) {
const SCEV *LoopStart = AR->getStart();		const SCEV *LoopStart = AR->getStart();
if (!isa<SCEVConstant>(LoopStart) &&		if (!isa<SCEVConstant>(LoopStart) &&
SE.isLoopInvariant(LoopStart, L))		SE.isLoopInvariant(LoopStart, L))
LoopCost = 0;		LoopCost = 0;
}		}
}		}
}		}
C.AddRecCost += LoopCost;		C.AddRecCost += LoopCost;

// Add the step value register, if it needs one.		// Add the step value register, if it needs one.
// TODO: The non-affine case isn't precisely modeled here.		// TODO: The non-affine case isn't precisely modeled here.
if (!AR->isAffine() \|\| !isa<SCEVConstant>(AR->getOperand(1))) {		if (!AR->isAffine() \|\| !isa<SCEVConstant>(AR->getOperand(1))) {
if (!Regs.count(AR->getOperand(1))) {		if (!Regs.count(AR->getOperand(1))) {
RateRegister(AR->getOperand(1), Regs, L, SE, DT, TTI);		RateRegister(F, AR->getOperand(1), Regs, L, SE, DT, TTI);
if (isLoser())		if (isLoser())
return;		return;
}		}
}		}
}		}
++C.NumRegs;		++C.NumRegs;

// Rough heuristic; favor registers which don't require extra setup		// Rough heuristic; favor registers which don't require extra setup
// instructions in the preheader.		// instructions in the preheader.
if (!isa<SCEVUnknown>(Reg) &&		if (!isa<SCEVUnknown>(Reg) &&
!isa<SCEVConstant>(Reg) &&		!isa<SCEVConstant>(Reg) &&
!(isa<SCEVAddRecExpr>(Reg) &&		!(isa<SCEVAddRecExpr>(Reg) &&
(isa<SCEVUnknown>(cast<SCEVAddRecExpr>(Reg)->getStart()) \|\|		(isa<SCEVUnknown>(cast<SCEVAddRecExpr>(Reg)->getStart()) \|\|
isa<SCEVConstant>(cast<SCEVAddRecExpr>(Reg)->getStart()))))		isa<SCEVConstant>(cast<SCEVAddRecExpr>(Reg)->getStart()))))
++C.SetupCost;		++C.SetupCost;

C.NumIVMuls += isa<SCEVMulExpr>(Reg) &&		C.NumIVMuls += isa<SCEVMulExpr>(Reg) &&
SE.hasComputableLoopEvolution(Reg, L);		SE.hasComputableLoopEvolution(Reg, L);
}		}

/// Record this register in the set. If we haven't seen it before, rate		/// Record this register in the set. If we haven't seen it before, rate
/// it. Optional LoserRegs provides a way to declare any formula that refers to		/// it. Optional LoserRegs provides a way to declare any formula that refers to
/// one of those regs an instant loser.		/// one of those regs an instant loser.
void Cost::RatePrimaryRegister(const SCEV *Reg,		void Cost::RatePrimaryRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
SmallPtrSetImpl<const SCEV > LoserRegs,		SmallPtrSetImpl<const SCEV > LoserRegs,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (LoserRegs && LoserRegs->count(Reg)) {		if (LoserRegs && LoserRegs->count(Reg)) {
Lose();		Lose();
return;		return;
}		}
if (Regs.insert(Reg).second) {		if (Regs.insert(Reg).second) {
RateRegister(Reg, Regs, L, SE, DT, TTI);		RateRegister(F, Reg, Regs, L, SE, DT, TTI);
if (LoserRegs && isLoser())		if (LoserRegs && isLoser())
LoserRegs->insert(Reg);		LoserRegs->insert(Reg);
}		}
}		}

void Cost::RateFormula(const TargetTransformInfo &TTI,		void Cost::RateFormula(const TargetTransformInfo &TTI,
const Formula &F,		const Formula &F,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const DenseSet<const SCEV *> &VisitedRegs,		const DenseSet<const SCEV *> &VisitedRegs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const LSRUse &LU,		const LSRUse &LU,
SmallPtrSetImpl<const SCEV > LoserRegs) {		SmallPtrSetImpl<const SCEV > LoserRegs) {
assert(F.isCanonical(*L) && "Cost is accurate only for canonical formula");		assert(F.isCanonical(*L) && "Cost is accurate only for canonical formula");
// Tally up the registers.		// Tally up the registers.
unsigned PrevAddRecCost = C.AddRecCost;		unsigned PrevAddRecCost = C.AddRecCost;
unsigned PrevNumRegs = C.NumRegs;		unsigned PrevNumRegs = C.NumRegs;
unsigned PrevNumBaseAdds = C.NumBaseAdds;		unsigned PrevNumBaseAdds = C.NumBaseAdds;
if (const SCEV *ScaledReg = F.ScaledReg) {		if (const SCEV *ScaledReg = F.ScaledReg) {
if (VisitedRegs.count(ScaledReg)) {		if (VisitedRegs.count(ScaledReg)) {
Lose();		Lose();
return;		return;
}		}
RatePrimaryRegister(ScaledReg, Regs, L, SE, DT, LoserRegs, TTI);		RatePrimaryRegister(F, ScaledReg, Regs, L, SE, DT, LoserRegs, TTI);
if (isLoser())		if (isLoser())
return;		return;
}		}
for (const SCEV *BaseReg : F.BaseRegs) {		for (const SCEV *BaseReg : F.BaseRegs) {
if (VisitedRegs.count(BaseReg)) {		if (VisitedRegs.count(BaseReg)) {
Lose();		Lose();
return;		return;
}		}
RatePrimaryRegister(BaseReg, Regs, L, SE, DT, LoserRegs, TTI);		RatePrimaryRegister(F, BaseReg, Regs, L, SE, DT, LoserRegs, TTI);
if (isLoser())		if (isLoser())
return;		return;
}		}

// Determine how many (unfolded) adds we'll need inside the loop.		// Determine how many (unfolded) adds we'll need inside the loop.
size_t NumBaseParts = F.getNumRegs();		size_t NumBaseParts = F.getNumRegs();
if (NumBaseParts > 1)		if (NumBaseParts > 1)
// Do not count the base and a possible second register if the target		// Do not count the base and a possible second register if the target
▲ Show 20 Lines • Show All 550 Lines • ▼ Show 20 Lines
/// This class holds state for the main loop strength reduction logic.		/// This class holds state for the main loop strength reduction logic.
class LSRInstance {		class LSRInstance {
IVUsers &IU;		IVUsers &IU;
ScalarEvolution &SE;		ScalarEvolution &SE;
DominatorTree &DT;		DominatorTree &DT;
LoopInfo &LI;		LoopInfo &LI;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
Loop *const L;		Loop *const L;
		bool FavorBackedgeIndex = false;
bool Changed = false;		bool Changed = false;

/// This is the insert position that the current loop's induction variable		/// This is the insert position that the current loop's induction variable
/// increment should be placed. In simple loops, this is the latch block's		/// increment should be placed. In simple loops, this is the latch block's
/// terminator. But in more complicated cases, this is a position which will		/// terminator. But in more complicated cases, this is a position which will
/// dominate all the in-loop post-increment users.		/// dominate all the in-loop post-increment users.
Instruction *IVIncInsertPos = nullptr;		Instruction *IVIncInsertPos = nullptr;

▲ Show 20 Lines • Show All 898 Lines • ▼ Show 20 Lines
///		///
/// Chaining IVs can lead to considerable code bloat if ISEL doesn't		/// Chaining IVs can lead to considerable code bloat if ISEL doesn't
/// effectively use postinc addressing modes. Only consider it profitable it the		/// effectively use postinc addressing modes. Only consider it profitable it the
/// increments can be computed in fewer registers when chained.		/// increments can be computed in fewer registers when chained.
///		///
/// TODO: Consider IVInc free if it's already used in another chains.		/// TODO: Consider IVInc free if it's already used in another chains.
static bool		static bool
isProfitableChain(IVChain &Chain, SmallPtrSetImpl<Instruction*> &Users,		isProfitableChain(IVChain &Chain, SmallPtrSetImpl<Instruction*> &Users,
ScalarEvolution &SE, const TargetTransformInfo &TTI) {		ScalarEvolution &SE) {
if (StressIVChain)		if (StressIVChain)
return true;		return true;

if (!Chain.hasIncs())		if (!Chain.hasIncs())
return false;		return false;

if (!Users.empty()) {		if (!Users.empty()) {
LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " users:\n";		LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " users:\n";
▲ Show 20 Lines • Show All 243 Lines • ▼ Show 20 Lines	for (PHINode &PN : L->getHeader()->phis()) {
if (IncV)		if (IncV)
ChainInstruction(&PN, IncV, ChainUsersVec);		ChainInstruction(&PN, IncV, ChainUsersVec);
}		}
// Remove any unprofitable chains.		// Remove any unprofitable chains.
unsigned ChainIdx = 0;		unsigned ChainIdx = 0;
for (unsigned UsersIdx = 0, NChains = IVChainVec.size();		for (unsigned UsersIdx = 0, NChains = IVChainVec.size();
UsersIdx < NChains; ++UsersIdx) {		UsersIdx < NChains; ++UsersIdx) {
if (!isProfitableChain(IVChainVec[UsersIdx],		if (!isProfitableChain(IVChainVec[UsersIdx],
ChainUsersVec[UsersIdx].FarUsers, SE, TTI))		ChainUsersVec[UsersIdx].FarUsers, SE))
continue;		continue;
// Preserve the chain at UsesIdx.		// Preserve the chain at UsesIdx.
if (ChainIdx != UsersIdx)		if (ChainIdx != UsersIdx)
IVChainVec[ChainIdx] = IVChainVec[UsersIdx];		IVChainVec[ChainIdx] = IVChainVec[UsersIdx];
FinalizeChain(IVChainVec[ChainIdx]);		FinalizeChain(IVChainVec[ChainIdx]);
++ChainIdx;		++ChainIdx;
}		}
IVChainVec.resize(ChainIdx);		IVChainVec.resize(ChainIdx);
}		}

void LSRInstance::FinalizeChain(IVChain &Chain) {		void LSRInstance::FinalizeChain(IVChain &Chain) {
assert(!Chain.Incs.empty() && "empty IV chains are not allowed");		assert(!Chain.Incs.empty() && "empty IV chains are not allowed");
LLVM_DEBUG(dbgs() << "Final Chain: " << *Chain.Incs[0].UserInst << "\n");		LLVM_DEBUG(dbgs() << "Final Chain: " << *Chain.Incs[0].UserInst << "\n");

for (const IVInc &Inc : Chain) {		for (const IVInc &Inc : Chain) {
LLVM_DEBUG(dbgs() << " Inc: " << *Inc.UserInst << "\n");		LLVM_DEBUG(dbgs() << " Inc: " << *Inc.UserInst << "\n");
auto UseI = find(Inc.UserInst->operands(), Inc.IVOperand);		auto UseI = find(Inc.UserInst->operands(), Inc.IVOperand);
assert(UseI != Inc.UserInst->op_end() && "cannot find IV operand");		assert(UseI != Inc.UserInst->op_end() && "cannot find IV operand");
IVIncSet.insert(UseI);		IVIncSet.insert(UseI);
}		}
}		}

▲ Show 20 Lines • Show All 643 Lines • ▼ Show 20 Lines	if (Base.Scale == 1)
GenerateSymbolicOffsetsImpl(LU, LUIdx, Base, /* Idx */ -1,		GenerateSymbolicOffsetsImpl(LU, LUIdx, Base, /* Idx */ -1,
/* IsScaledReg */ true);		/* IsScaledReg */ true);
}		}

/// Helper function for LSRInstance::GenerateConstantOffsets.		/// Helper function for LSRInstance::GenerateConstantOffsets.
void LSRInstance::GenerateConstantOffsetsImpl(		void LSRInstance::GenerateConstantOffsetsImpl(
LSRUse &LU, unsigned LUIdx, const Formula &Base,		LSRUse &LU, unsigned LUIdx, const Formula &Base,
const SmallVectorImpl<int64_t> &Worklist, size_t Idx, bool IsScaledReg) {		const SmallVectorImpl<int64_t> &Worklist, size_t Idx, bool IsScaledReg) {
const SCEV *G = IsScaledReg ? Base.ScaledReg : Base.BaseRegs[Idx];
for (int64_t Offset : Worklist) {		auto GenerateOffset = [&](const SCEV *G, int64_t Offset) {
Formula F = Base;		Formula F = Base;
F.BaseOffset = (uint64_t)Base.BaseOffset - Offset;		F.BaseOffset = (uint64_t)Base.BaseOffset - Offset;

if (isLegalUse(TTI, LU.MinOffset - Offset, LU.MaxOffset - Offset, LU.Kind,		if (isLegalUse(TTI, LU.MinOffset - Offset, LU.MaxOffset - Offset, LU.Kind,
LU.AccessTy, F)) {		LU.AccessTy, F)) {
// Add the offset to the base register.		// Add the offset to the base register.
const SCEV *NewG = SE.getAddExpr(SE.getConstant(G->getType(), Offset), G);		const SCEV *NewG = SE.getAddExpr(SE.getConstant(G->getType(), Offset), G);
// If it cancelled out, drop the base register, otherwise update it.		// If it cancelled out, drop the base register, otherwise update it.
if (NewG->isZero()) {		if (NewG->isZero()) {
if (IsScaledReg) {		if (IsScaledReg) {
F.Scale = 0;		F.Scale = 0;
F.ScaledReg = nullptr;		F.ScaledReg = nullptr;
} else		} else
F.deleteBaseReg(F.BaseRegs[Idx]);		F.deleteBaseReg(F.BaseRegs[Idx]);
F.canonicalize(*L);		F.canonicalize(*L);
} else if (IsScaledReg)		} else if (IsScaledReg)
F.ScaledReg = NewG;		F.ScaledReg = NewG;
else		else
F.BaseRegs[Idx] = NewG;		F.BaseRegs[Idx] = NewG;

(void)InsertFormula(LU, LUIdx, F);		(void)InsertFormula(LU, LUIdx, F);
}		}
		};

		const SCEV *G = IsScaledReg ? Base.ScaledReg : Base.BaseRegs[Idx];

		// With constant offsets and constant steps, we can generate pre-inc
		// accesses by having the offset equal the step. So, for access #0 with a
		// step of 8, we generate a G - 8 base which would require the first access
		// to be ((G - 8) + 8),+,8. The pre-indexed access then updates the pointer
		// for itself and hopefully becomes the base for other accesses. This means
		// means that a single pre-indexed access can be generated to become the new
		// base pointer for each iteration of the loop, resulting in no extra add/sub
		// instructions for pointer updating.
		if (FavorBackedgeIndex && LU.Kind == LSRUse::Address) {
		if (auto *GAR = dyn_cast<SCEVAddRecExpr>(G)) {
		if (auto *StepRec =
		dyn_cast<SCEVConstant>(GAR->getStepRecurrence(SE))) {
		const APInt &StepInt = StepRec->getAPInt();
		int64_t Step = StepInt.isNegative() ?
		StepInt.getSExtValue() : StepInt.getZExtValue();

		for (int64_t Offset : Worklist) {
		Offset -= Step;
		GenerateOffset(G, Offset);
		}
		}
		}
}		}
		for (int64_t Offset : Worklist)
		GenerateOffset(G, Offset);

int64_t Imm = ExtractImmediate(G, SE);		int64_t Imm = ExtractImmediate(G, SE);
if (G->isZero() \|\| Imm == 0)		if (G->isZero() \|\| Imm == 0)
return;		return;
Formula F = Base;		Formula F = Base;
F.BaseOffset = (uint64_t)F.BaseOffset + Imm;		F.BaseOffset = (uint64_t)F.BaseOffset + Imm;
if (!isLegalUse(TTI, LU.MinOffset, LU.MaxOffset, LU.Kind, LU.AccessTy, F))		if (!isLegalUse(TTI, LU.MinOffset, LU.MaxOffset, LU.Kind, LU.AccessTy, F))
return;		return;
▲ Show 20 Lines • Show All 640 Lines • ▼ Show 20 Lines	if (EstimateSearchSpaceComplexity() >= ComplexityLimit) {

LLVM_DEBUG(dbgs() << "After pre-selection:\n"; print_uses(dbgs()));		LLVM_DEBUG(dbgs() << "After pre-selection:\n"; print_uses(dbgs()));
}		}
}		}

/// When there are many registers for expressions like A, A+1, A+2, etc.,		/// When there are many registers for expressions like A, A+1, A+2, etc.,
/// allocate a single register for them.		/// allocate a single register for them.
void LSRInstance::NarrowSearchSpaceByCollapsingUnrolledCode() {		void LSRInstance::NarrowSearchSpaceByCollapsingUnrolledCode() {
if (EstimateSearchSpaceComplexity() < ComplexityLimit)		if (EstimateSearchSpaceComplexity() < ComplexityLimit)
return;		return;

LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "The search space is too complex.\n"		dbgs() << "The search space is too complex.\n"
"Narrowing the search space by assuming that uses separated "		"Narrowing the search space by assuming that uses separated "
"by a constant offset will use the same registers.\n");		"by a constant offset will use the same registers.\n");

// This is especially useful for unrolled loops.		// This is especially useful for unrolled loops.
▲ Show 20 Lines • Show All 944 Lines • ▼ Show 20 Lines	#endif
Rewriter.clear();		Rewriter.clear();

Changed \|= DeleteTriviallyDeadInstructions(DeadInsts);		Changed \|= DeleteTriviallyDeadInstructions(DeadInsts);
}		}

LSRInstance::LSRInstance(Loop *L, IVUsers &IU, ScalarEvolution &SE,		LSRInstance::LSRInstance(Loop *L, IVUsers &IU, ScalarEvolution &SE,
DominatorTree &DT, LoopInfo &LI,		DominatorTree &DT, LoopInfo &LI,
const TargetTransformInfo &TTI)		const TargetTransformInfo &TTI)
: IU(IU), SE(SE), DT(DT), LI(LI), TTI(TTI), L(L) {		: IU(IU), SE(SE), DT(DT), LI(LI), TTI(TTI), L(L),
		FavorBackedgeIndex(EnableBackedgeIndexing &&
		TTI.shouldFavorBackedgeIndex(L)) {
// If LoopSimplify form is not available, stay out of trouble.		// If LoopSimplify form is not available, stay out of trouble.
if (!L->isLoopSimplifyForm())		if (!L->isLoopSimplifyForm())
return;		return;

// If there's no interesting work to be done, bail early.		// If there's no interesting work to be done, bail early.
if (IU.empty()) return;		if (IU.empty()) return;

// If there's too much analysis to be done, bail early. We won't be able to		// If there's too much analysis to be done, bail early. We won't be able to
▲ Show 20 Lines • Show All 258 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/ARM/dsp-loop-indexing.ll

				; RUN: llc -mtriple=thumbv7em -mattr=+fp-armv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-backedge-indexing=false %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-complexity-limit=2147483647 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-COMPLEX

				; CHECK-LABEL: test_qadd_2
				; CHECK: @ %loop

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #8]!
				; CHECK-DEAFULT: ldr{{.*}}, #8]!
				; CHECK-DEFAULT: str{{.*}}, #8]!

				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: str{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #4]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_2(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = or i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = add nsw nuw i32 %idx.1, 2
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd_2_backwards
				; TODO: Indexes should be generated.

				; CHECK: @ %loop

				; CHECK-DEFAULT: ldr{{.*}},
				; CHECK-DEFAULT: ldr{{.*}},
				; CHECK-DEFAULT: str{{.*}},
				; CHECK-DEFAULT: ldr{{.*}}, #-4]
				; CHECK-DEFAULT: ldr{{.*}}, #-4]
				; CHECK-DEFAULT: sub{{.*}}, #8
				; CHECK-DEFAULT: str{{.*}}, #-4]
				; CHECK-DEFAULT: sub{{.*}}, #8

				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: str{{.*}} lsl #2]
				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: str{{.*}} lsl #2]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_2_backwards(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ %N, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = sub nsw nuw i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = sub nsw nuw i32 %idx.1, 2
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd_3
				; CHECK: @ %loop

				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #12]!
				; CHECK-DEFAULT: ldr{{.*}}, #12]!
				; CHECK-DEFAULT: str{{.*}}, #12]!

				; CHECK-COMPLEX: ldr{{.*}}, #12]!
				; CHECK-COMPLEX: ldr{{.*}}, #12]!
				; CHECK-COMPLEX: str{{.*}}, #12]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: str{{.*}}, #8]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_3(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = add nuw nsw i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%idx.3 = add nuw nsw i32 %idx.1, 2
				%gep.a.3 = getelementptr inbounds i32, i32* %a.array, i32 %idx.3
				%a.3 = load i32, i32* %gep.a.3
				%gep.b.3 = getelementptr inbounds i32, i32* %b.array, i32 %idx.3
				%b.3 = load i32, i32* %gep.b.3
				%qadd.3 = call i32 @llvm.arm.qadd(i32 %a.3, i32 %b.3)
				%addr.3 = getelementptr inbounds i32, i32* %out.array, i32 %idx.3
				store i32 %qadd.3, i32* %addr.3
				%i.next = add nsw nuw i32 %i, -3
				%idx.next = add nsw nuw i32 %idx.1, 3
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd_4
				; CHECK: @ %loop

				; TODO: pre-inc store

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: str{{.*}}, #12]

				; CHECK-COMPLEX: ldr{{.*}}, #16]!
				; CHECK-COMPLEX: ldr{{.*}}, #16]!
				; CHECK-COMPLEX: str{{.*}}, #16]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: str{{.*}}, #8]
				; CHECK-COMPLEX: ldr{{.*}}, #12]
				; CHECK-COMPLEX: ldr{{.*}}, #12]
				; CHECK-COMPLEX: str{{.*}}, #12]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_4(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = or i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%idx.3 = or i32 %idx.1, 2
				%gep.a.3 = getelementptr inbounds i32, i32* %a.array, i32 %idx.3
				%a.3 = load i32, i32* %gep.a.3
				%gep.b.3 = getelementptr inbounds i32, i32* %b.array, i32 %idx.3
				%b.3 = load i32, i32* %gep.b.3
				%qadd.3 = call i32 @llvm.arm.qadd(i32 %a.3, i32 %b.3)
				%addr.3 = getelementptr inbounds i32, i32* %out.array, i32 %idx.3
				store i32 %qadd.3, i32* %addr.3
				%idx.4 = or i32 %idx.1, 3
				%gep.a.4 = getelementptr inbounds i32, i32* %a.array, i32 %idx.4
				%a.4 = load i32, i32* %gep.a.4
				%gep.b.4 = getelementptr inbounds i32, i32* %b.array, i32 %idx.4
				%b.4 = load i32, i32* %gep.b.4
				%qadd.4 = call i32 @llvm.arm.qadd(i32 %a.4, i32 %b.4)
				%addr.4 = getelementptr inbounds i32, i32* %out.array, i32 %idx.4
				store i32 %qadd.4, i32* %addr.4
				%i.next = add nsw nuw i32 %i, -4
				%idx.next = add nsw nuw i32 %idx.1, 4
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd16_2
				; CHECK: @ %loop
				; TODO: pre-inc store.

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]!
				; CHECK-DEFAULT: ldr{{.*}}, #8]!
				; CHECK-DEFAULT: str{{.*}}, #16]!

				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: str{{.*}}, #16]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #8]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd16_2(i16* %a.array, i16* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i16, i16* %a.array, i32 %idx.1
				%cast.a.1 = bitcast i16* %gep.a.1 to i32*
				%a.1 = load i32, i32* %cast.a.1
				%gep.b.1 = getelementptr inbounds i16, i16* %b.array, i32 %idx.1
				%cast.b.1 = bitcast i16* %gep.b.1 to i32*
				%b.1 = load i32, i32* %cast.b.1
				%qadd.1 = call i32 @llvm.arm.qadd16(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = add nsw nuw i32 %idx.1, 2
				%gep.a.2 = getelementptr inbounds i16, i16* %a.array, i32 %idx.2
				%cast.a.2 = bitcast i16* %gep.a.2 to i32*
				%a.2 = load i32, i32* %cast.a.2
				%gep.b.2 = getelementptr inbounds i16, i16* %b.array, i32 %idx.2
				%cast.b.2 = bitcast i16* %gep.b.2 to i32*
				%b.2 = load i32, i32* %cast.b.2
				%qadd.2 = call i32 @llvm.arm.qadd16(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = add nsw nuw i32 %idx.1, 4
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				declare i32 @llvm.arm.qadd(i32, i32)
				declare i32 @llvm.arm.qadd16(i32, i32)

llvm/trunk/test/CodeGen/ARM/loop-align-cortex-m.ll

	; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m3 -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m3 -o - \| FileCheck %s
	; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m4 -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m4 -o - \| FileCheck %s
	; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m33 -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv8m-none-eabi %s -mcpu=cortex-m33 -o - \| FileCheck %s

	define void @test_loop_alignment(i32* %in, i32* %out) optsize {			define void @test_loop_alignment(i32* %in, i32* %out) optsize {
	; CHECK-LABEL: test_loop_alignment:			; CHECK-LABEL: test_loop_alignment:
	; CHECK: movs {{r[0-9]+}}, #0			; CHECK: mov{{.*}}, #0
	; CHECK: .p2align 2			; CHECK: .p2align 2

	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]			%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
	%in.addr = getelementptr inbounds i32, i32* %in, i32 %i			%in.addr = getelementptr inbounds i32, i32* %in, i32 %i
	Show All 34 Lines

llvm/trunk/test/CodeGen/ARM/loop-indexing.ll

				; RUN: llc -mtriple=thumbv7em -mattr=+fp-armv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-BASE --check-prefix=CHECK-DEFAULT --check-prefix=CHECK-T2
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT --check-prefix=CHECK-T2
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-backedge-indexing=false %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8m.base %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-complexity-limit=2147483647 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-COMPLEX --check-prefix=CHECK-T2

				; Tests to check that post increment addressing modes are used instead of
				; updating base pointers with add instructions.

				; TODO: I think we should be able to use post inc addressing with VLDM
				; instructions.
				; CHECK-LABEL: test_fma
				; CHECK: @ %loop

				; CHECK-BASE: vldr s{{.*}}, #8]
				; CHECK-BASE: vldr s{{.*}}, #8]
				; CHECK-BASE: vldr s{{.*}}, #12]
				; CHECK-BASE: vldr s{{.*}}, #12]

				; CHECK-COMPLEX: vldr s{{.*}}, #8]
				; CHECK-COMPLEX: vldr s{{.*}}, #8]
				; CHECK-COMPLEX: vldr s{{.*}}, #12]
				; CHECK-COMPLEX: vldr s{{.*}}, #12]

				define float @test_fma(float* %a, float* %b, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%res = phi float [ 0.0, %entry ], [ %fma.2, %loop ]
				%gep.a.1 = getelementptr inbounds float, float* %a, i32 %idx.1
				%a.1 = load float, float* %gep.a.1
				%gep.b.1 = getelementptr inbounds float, float* %b, i32 %idx.1
				%b.1 = load float, float* %gep.b.1
				%fmul.1 = fmul float %a.1, %b.1
				%fma.1 = fadd float %fmul.1, %res
				%idx.2 = or i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds float, float* %a, i32 %idx.2
				%a.2 = load float, float* %gep.a.2
				%gep.b.2 = getelementptr inbounds float, float* %b, i32 %idx.2
				%b.2 = load float, float* %gep.b.2
				%fmul.2 = fmul float %a.2, %b.2
				%fma.2 = fadd float %fmul.2, %fma.1
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = add nsw nuw i32 %idx.1, 2
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret float %fma.2
				}

				; CHECK-LABEL: convolve_16bit
				; TODO: Both arrays should use indexing
				; CHECK-DEFAULT: ldr{{.*}}, #8]!
				; CHECK-DEFAULT: ldr{{.*}}, #10]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #6]

				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #10]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #6]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @convolve_16bit(i16 nocapture readonly %input_image, i16 nocapture readonly %filter,
				i32 %filter_dim, i32 %out_width, i32 %out_height,
				i32** nocapture readonly %convolved) {
				entry:
				%cmp92 = icmp eq i32 %out_height, 0
				br i1 %cmp92, label %for.cond.cleanup, label %for.cond1.preheader.lr.ph

				for.cond1.preheader.lr.ph: ; preds = %entry
				%xtraiter = and i32 %filter_dim, 3
				%unroll_iter = sub i32 %filter_dim, %xtraiter
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.cond.cleanup3, %for.cond1.preheader.lr.ph
				%res_y.093 = phi i32 [ 0, %for.cond1.preheader.lr.ph ], [ %add28, %for.cond.cleanup3 ]
				%arrayidx22 = getelementptr inbounds i32, i32* %convolved, i32 %res_y.093
				%tmp3 = load i32, i32* %arrayidx22, align 4
				br label %for.cond9.preheader.us.us.preheader

				for.cond9.preheader.us.us.preheader: ; preds = %for.cond5.for.cond.cleanup7_crit_edge.us, %for.cond5.preheader.lr.ph
				%res_x.060.us = phi i32 [ %add25.us, %for.cond5.for.cond.cleanup7_crit_edge.us ], [ 0, %for.cond1.preheader ]
				br label %for.cond9.preheader.us.us

				for.cond9.preheader.us.us: ; preds = %for.cond9.for.cond.cleanup11_crit_edge.us.us, %for.cond9.preheader.us.us.preheader
				%filter_y.056.us.us = phi i32 [ %inc20.us.us, %for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa ], [ 0, %for.cond9.preheader.us.us.preheader ]
				%result_element.055.us.us = phi i32 [ %add18.us.us.3, %for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa ], [ 0, %for.cond9.preheader.us.us.preheader ]
				%add.us.us = add i32 %filter_y.056.us.us, %res_y.093
				%arrayidx.us.us = getelementptr inbounds i16, i16* %filter, i32 %filter_y.056.us.us
				%tmp5 = load i16, i16* %arrayidx.us.us, align 4
				%arrayidx15.us.us = getelementptr inbounds i16, i16* %input_image, i32 %add.us.us
				%tmp6 = load i16, i16* %arrayidx15.us.us, align 4
				br label %for.body12.us.us

				for.body12.us.us: ; preds = %for.body12.us.us, %for.cond9.preheader.us.us
				%filter_x.053.us.us = phi i32 [ %inc.us.us.3, %for.body12.us.us ], [ 0, %for.cond9.preheader.us.us ]
				%result_element.152.us.us = phi i32 [ %add18.us.us.3, %for.body12.us.us ], [ %result_element.055.us.us, %for.cond9.preheader.us.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body12.us.us ], [ %unroll_iter, %for.cond9.preheader.us.us ]
				%add13.us.us = add i32 %filter_x.053.us.us, %res_x.060.us
				%arrayidx14.us.us = getelementptr inbounds i16, i16* %tmp5, i32 %filter_x.053.us.us
				%tmp9 = load i16, i16* %arrayidx14.us.us, align 2
				%conv.us.us = sext i16 %tmp9 to i32
				%arrayidx16.us.us = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us
				%tmp10 = load i16, i16* %arrayidx16.us.us, align 2
				%conv17.us.us = sext i16 %tmp10 to i32
				%mul.us.us = mul nsw i32 %conv17.us.us, %conv.us.us
				%add18.us.us = add nsw i32 %mul.us.us, %result_element.152.us.us
				%inc.us.us = or i32 %filter_x.053.us.us, 1
				%add13.us.us.1 = add i32 %inc.us.us, %res_x.060.us
				%arrayidx14.us.us.1 = getelementptr inbounds i16, i16* %tmp5, i32 %inc.us.us
				%tmp11 = load i16, i16* %arrayidx14.us.us.1, align 2
				%conv.us.us.1 = sext i16 %tmp11 to i32
				%arrayidx16.us.us.1 = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us.1
				%tmp12 = load i16, i16* %arrayidx16.us.us.1, align 2
				%conv17.us.us.1 = sext i16 %tmp12 to i32
				%mul.us.us.1 = mul nsw i32 %conv17.us.us.1, %conv.us.us.1
				%add18.us.us.1 = add nsw i32 %mul.us.us.1, %add18.us.us
				%inc.us.us.1 = or i32 %filter_x.053.us.us, 2
				%add13.us.us.2 = add i32 %inc.us.us.1, %res_x.060.us
				%arrayidx14.us.us.2 = getelementptr inbounds i16, i16* %tmp5, i32 %inc.us.us.1
				%tmp13 = load i16, i16* %arrayidx14.us.us.2, align 2
				%conv.us.us.2 = sext i16 %tmp13 to i32
				%arrayidx16.us.us.2 = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us.2
				%tmp14 = load i16, i16* %arrayidx16.us.us.2, align 2
				%conv17.us.us.2 = sext i16 %tmp14 to i32
				%mul.us.us.2 = mul nsw i32 %conv17.us.us.2, %conv.us.us.2
				%add18.us.us.2 = add nsw i32 %mul.us.us.2, %add18.us.us.1
				%inc.us.us.2 = or i32 %filter_x.053.us.us, 3
				%add13.us.us.3 = add i32 %inc.us.us.2, %res_x.060.us
				%arrayidx14.us.us.3 = getelementptr inbounds i16, i16* %tmp5, i32 %inc.us.us.2
				%tmp15 = load i16, i16* %arrayidx14.us.us.3, align 2
				%conv.us.us.3 = sext i16 %tmp15 to i32
				%arrayidx16.us.us.3 = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us.3
				%tmp16 = load i16, i16* %arrayidx16.us.us.3, align 2
				%conv17.us.us.3 = sext i16 %tmp16 to i32
				%mul.us.us.3 = mul nsw i32 %conv17.us.us.3, %conv.us.us.3
				%add18.us.us.3 = add nsw i32 %mul.us.us.3, %add18.us.us.2
				%inc.us.us.3 = add i32 %filter_x.053.us.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa, label %for.body12.us.us

				for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa: ; preds = %for.body12.us.us, %for.cond9.preheader.us.us
				%inc20.us.us = add nuw i32 %filter_y.056.us.us, 1
				%exitcond98 = icmp eq i32 %inc20.us.us, %filter_dim
				br i1 %exitcond98, label %for.cond5.for.cond.cleanup7_crit_edge.us, label %for.cond9.preheader.us.us

				for.cond5.for.cond.cleanup7_crit_edge.us: ; preds = %for.cond9.for.cond.cleanup11_crit_edge.us.us
				%arrayidx23.us = getelementptr inbounds i32, i32* %tmp3, i32 %res_x.060.us
				store i32 %add18.us.us.3, i32* %arrayidx23.us, align 4
				%add25.us = add nuw i32 %res_x.060.us, 1
				%exitcond99 = icmp eq i32 %add25.us, %out_width
				br i1 %exitcond99, label %for.cond.cleanup3, label %for.cond9.preheader.us.us.preheader

				for.cond.cleanup3: ; preds = %for.cond5.for.cond.cleanup7_crit_edge.us, %for.cond5.preheader.preheader, %for.cond1.preheader
				%add28 = add nuw i32 %res_y.093, 1
				%exitcond100 = icmp eq i32 %add28, %out_height
				br i1 %exitcond100, label %for.cond.cleanup, label %for.cond1.preheader

				for.cond.cleanup: ; preds = %for.cond.cleanup3, %entry
				ret void
				}

				; CHECK-LABEL: mul_8x8
				; CHECK: @ %for.body

				; CHECK-DEFAULT: ldrb{{.*}}, #3]
				; CHECK-DEFAULT: ldrb{{.*}}, #3]
				; CHECK-DEFAULT: str{{.*}}, #16]!
				; CHECK-DEFAULT: ldrb{{.*}}, #4]!
				; CHECK-DEFAULT: ldrb{{.*}}, #4]!
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldrb{{.*}}, #1]
				; CHECK-DEFAULT: ldrb{{.*}}, #1]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldrb{{.*}}, #2]
				; CHECK-DEFAULT: ldrb{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #12]

				; CHECK-COMPLEX: ldrb{{.*}}, #3]
				; CHECK-COMPLEX: ldrb{{.*}}, #3]
				; CHECK-COMPLEX: str{{.*}}, #16]!
				; CHECK-COMPLEX: ldrb{{.*}}, #4]!
				; CHECK-COMPLEX: ldrb{{.*}}, #4]!
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldrb{{.*}}, #1]
				; CHECK-COMPLEX: ldrb{{.*}}, #1]
				; CHECK-COMPLEX: str{{.*}}, #8]
				; CHECK-COMPLEX: ldrb{{.*}}, #2]
				; CHECK-COMPLEX: ldrb{{.*}}, #2]
				; CHECK-COMPLEX: str{{.*}}, #12]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul_8x8(i8* nocapture readonly %A, i8* nocapture readonly %B, i32* nocapture %C, i32 %N) {
				entry:
				%cmp9 = icmp eq i32 %N, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.010.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.010.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.010.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i8, i8* %A, i32 %i.010.epil
				%tmp2 = load i8, i8* %arrayidx.epil, align 1
				%conv.epil = zext i8 %tmp2 to i32
				%arrayidx1.epil = getelementptr inbounds i8, i8* %B, i32 %i.010.epil
				%tmp3 = load i8, i8* %arrayidx1.epil, align 1
				%conv2.epil = zext i8 %tmp3 to i32
				%mul.epil = mul nuw nsw i32 %conv2.epil, %conv.epil
				%arrayidx3.epil = getelementptr inbounds i32, i32* %C, i32 %i.010.epil
				store i32 %mul.epil, i32* %arrayidx3.epil, align 4
				%inc.epil = add nuw i32 %i.010.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.010 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %A, i32 %i.010
				%tmp4 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %tmp4 to i32
				%arrayidx1 = getelementptr inbounds i8, i8* %B, i32 %i.010
				%tmp5 = load i8, i8* %arrayidx1, align 1
				%conv2 = zext i8 %tmp5 to i32
				%mul = mul nuw nsw i32 %conv2, %conv
				%arrayidx3 = getelementptr inbounds i32, i32* %C, i32 %i.010
				store i32 %mul, i32* %arrayidx3, align 4
				%inc = or i32 %i.010, 1
				%arrayidx.1 = getelementptr inbounds i8, i8* %A, i32 %inc
				%tmp6 = load i8, i8* %arrayidx.1, align 1
				%conv.1 = zext i8 %tmp6 to i32
				%arrayidx1.1 = getelementptr inbounds i8, i8* %B, i32 %inc
				%tmp7 = load i8, i8* %arrayidx1.1, align 1
				%conv2.1 = zext i8 %tmp7 to i32
				%mul.1 = mul nuw nsw i32 %conv2.1, %conv.1
				%arrayidx3.1 = getelementptr inbounds i32, i32* %C, i32 %inc
				store i32 %mul.1, i32* %arrayidx3.1, align 4
				%inc.1 = or i32 %i.010, 2
				%arrayidx.2 = getelementptr inbounds i8, i8* %A, i32 %inc.1
				%tmp8 = load i8, i8* %arrayidx.2, align 1
				%conv.2 = zext i8 %tmp8 to i32
				%arrayidx1.2 = getelementptr inbounds i8, i8* %B, i32 %inc.1
				%tmp9 = load i8, i8* %arrayidx1.2, align 1
				%conv2.2 = zext i8 %tmp9 to i32
				%mul.2 = mul nuw nsw i32 %conv2.2, %conv.2
				%arrayidx3.2 = getelementptr inbounds i32, i32* %C, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx3.2, align 4
				%inc.2 = or i32 %i.010, 3
				%arrayidx.3 = getelementptr inbounds i8, i8* %A, i32 %inc.2
				%tmp10 = load i8, i8* %arrayidx.3, align 1
				%conv.3 = zext i8 %tmp10 to i32
				%arrayidx1.3 = getelementptr inbounds i8, i8* %B, i32 %inc.2
				%tmp11 = load i8, i8* %arrayidx1.3, align 1
				%conv2.3 = zext i8 %tmp11 to i32
				%mul.3 = mul nuw nsw i32 %conv2.3, %conv.3
				%arrayidx3.3 = getelementptr inbounds i32, i32* %C, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx3.3, align 4
				%inc.3 = add i32 %i.010, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

				; CHECK-LABEL: mul_16x8
				; CHECK: @ %for.body

				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: ldrb{{.*}}, #-1]
				; CHECK-DEFAULT: str{{.*}}, #16]!
				; CHECK-DEFAULT: ldrb{{.*}},
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldrsh{{.*}}, #4]
				; CHECK-DEFAULT: ldrb{{.*}}, #1]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]!
				; CHECK-DEFAULT: ldrb{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #12]

				; CHECK-COMPLEX: ldrsh{{.*}}, #8]!
				; CHECK-COMPLEX: str{{.*}}, #16]!
				; CHECK-COMPLEX: ldrb{{.*}}, #4]!

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul_16x8(i16* nocapture readonly %A, i8* nocapture readonly %B, i32* nocapture %C, i32 %N) {
				entry:
				%cmp9 = icmp eq i32 %N, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.010.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.010.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.010.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i16, i16* %A, i32 %i.010.epil
				%tmp2 = load i16, i16* %arrayidx.epil, align 2
				%conv.epil = sext i16 %tmp2 to i32
				%arrayidx1.epil = getelementptr inbounds i8, i8* %B, i32 %i.010.epil
				%tmp3 = load i8, i8* %arrayidx1.epil, align 1
				%conv2.epil = zext i8 %tmp3 to i32
				%mul.epil = mul nsw i32 %conv2.epil, %conv.epil
				%arrayidx3.epil = getelementptr inbounds i32, i32* %C, i32 %i.010.epil
				store i32 %mul.epil, i32* %arrayidx3.epil, align 4
				%inc.epil = add nuw i32 %i.010.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.010 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i16, i16* %A, i32 %i.010
				%tmp4 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %tmp4 to i32
				%arrayidx1 = getelementptr inbounds i8, i8* %B, i32 %i.010
				%tmp5 = load i8, i8* %arrayidx1, align 1
				%conv2 = zext i8 %tmp5 to i32
				%mul = mul nsw i32 %conv2, %conv
				%arrayidx3 = getelementptr inbounds i32, i32* %C, i32 %i.010
				store i32 %mul, i32* %arrayidx3, align 4
				%inc = or i32 %i.010, 1
				%arrayidx.1 = getelementptr inbounds i16, i16* %A, i32 %inc
				%tmp6 = load i16, i16* %arrayidx.1, align 2
				%conv.1 = sext i16 %tmp6 to i32
				%arrayidx1.1 = getelementptr inbounds i8, i8* %B, i32 %inc
				%tmp7 = load i8, i8* %arrayidx1.1, align 1
				%conv2.1 = zext i8 %tmp7 to i32
				%mul.1 = mul nsw i32 %conv2.1, %conv.1
				%arrayidx3.1 = getelementptr inbounds i32, i32* %C, i32 %inc
				store i32 %mul.1, i32* %arrayidx3.1, align 4
				%inc.1 = or i32 %i.010, 2
				%arrayidx.2 = getelementptr inbounds i16, i16* %A, i32 %inc.1
				%tmp8 = load i16, i16* %arrayidx.2, align 2
				%conv.2 = sext i16 %tmp8 to i32
				%arrayidx1.2 = getelementptr inbounds i8, i8* %B, i32 %inc.1
				%tmp9 = load i8, i8* %arrayidx1.2, align 1
				%conv2.2 = zext i8 %tmp9 to i32
				%mul.2 = mul nsw i32 %conv2.2, %conv.2
				%arrayidx3.2 = getelementptr inbounds i32, i32* %C, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx3.2, align 4
				%inc.2 = or i32 %i.010, 3
				%arrayidx.3 = getelementptr inbounds i16, i16* %A, i32 %inc.2
				%tmp10 = load i16, i16* %arrayidx.3, align 2
				%conv.3 = sext i16 %tmp10 to i32
				%arrayidx1.3 = getelementptr inbounds i8, i8* %B, i32 %inc.2
				%tmp11 = load i8, i8* %arrayidx1.3, align 1
				%conv2.3 = zext i8 %tmp11 to i32
				%mul.3 = mul nsw i32 %conv2.3, %conv.3
				%arrayidx3.3 = getelementptr inbounds i32, i32* %C, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx3.3, align 4
				%inc.3 = add i32 %i.010, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

				; CHECK-LABEL: mul_16x16
				; CHECK: @ %for.body

				; TODO: pre-inc store
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #16]!
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldrsh{{.*}}, #4]
				; CHECK-DEFAULT: ldrsh{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]
				; CHECK-DEFAULT: str{{.*}}, #12]

				; CHECK-COMPLEX: ldrsh
				; CHECK-COMPLEX: ldrsh
				; CHECK-COMPLEX: str
				; CHECK-COMPLEX: ldrsh{{.*}}, #2]
				; CHECK-COMPLEX: ldrsh{{.*}}, #2]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldrsh{{.*}}, #4]
				; CHECK-COMPLEX: ldrsh{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #8]
				; CHECK-COMPLEX: ldrsh{{.*}}, #6]
				; CHECK-COMPLEX: ldrsh{{.*}}, #6]
				; CHECK-COMPLEX: str{{.*}}, #12]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul_16x16(i16* nocapture readonly %A, i16* nocapture readonly %B, i32* nocapture %C, i32 %N) {
				entry:
				%cmp9 = icmp eq i32 %N, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.010.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.010.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.010.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i16, i16* %A, i32 %i.010.epil
				%tmp2 = load i16, i16* %arrayidx.epil, align 2
				%conv.epil = sext i16 %tmp2 to i32
				%arrayidx1.epil = getelementptr inbounds i16, i16* %B, i32 %i.010.epil
				%tmp3 = load i16, i16* %arrayidx1.epil, align 2
				%conv2.epil = sext i16 %tmp3 to i32
				%mul.epil = mul nsw i32 %conv2.epil, %conv.epil
				%arrayidx3.epil = getelementptr inbounds i32, i32* %C, i32 %i.010.epil
				store i32 %mul.epil, i32* %arrayidx3.epil, align 4
				%inc.epil = add nuw i32 %i.010.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.010 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i16, i16* %A, i32 %i.010
				%tmp4 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %tmp4 to i32
				%arrayidx1 = getelementptr inbounds i16, i16* %B, i32 %i.010
				%tmp5 = load i16, i16* %arrayidx1, align 2
				%conv2 = sext i16 %tmp5 to i32
				%mul = mul nsw i32 %conv2, %conv
				%arrayidx3 = getelementptr inbounds i32, i32* %C, i32 %i.010
				store i32 %mul, i32* %arrayidx3, align 4
				%inc = or i32 %i.010, 1
				%arrayidx.1 = getelementptr inbounds i16, i16* %A, i32 %inc
				%tmp6 = load i16, i16* %arrayidx.1, align 2
				%conv.1 = sext i16 %tmp6 to i32
				%arrayidx1.1 = getelementptr inbounds i16, i16* %B, i32 %inc
				%tmp7 = load i16, i16* %arrayidx1.1, align 2
				%conv2.1 = sext i16 %tmp7 to i32
				%mul.1 = mul nsw i32 %conv2.1, %conv.1
				%arrayidx3.1 = getelementptr inbounds i32, i32* %C, i32 %inc
				store i32 %mul.1, i32* %arrayidx3.1, align 4
				%inc.1 = or i32 %i.010, 2
				%arrayidx.2 = getelementptr inbounds i16, i16* %A, i32 %inc.1
				%tmp8 = load i16, i16* %arrayidx.2, align 2
				%conv.2 = sext i16 %tmp8 to i32
				%arrayidx1.2 = getelementptr inbounds i16, i16* %B, i32 %inc.1
				%tmp9 = load i16, i16* %arrayidx1.2, align 2
				%conv2.2 = sext i16 %tmp9 to i32
				%mul.2 = mul nsw i32 %conv2.2, %conv.2
				%arrayidx3.2 = getelementptr inbounds i32, i32* %C, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx3.2, align 4
				%inc.2 = or i32 %i.010, 3
				%arrayidx.3 = getelementptr inbounds i16, i16* %A, i32 %inc.2
				%tmp10 = load i16, i16* %arrayidx.3, align 2
				%conv.3 = sext i16 %tmp10 to i32
				%arrayidx1.3 = getelementptr inbounds i16, i16* %B, i32 %inc.2
				%tmp11 = load i16, i16* %arrayidx1.3, align 2
				%conv2.3 = sext i16 %tmp11 to i32
				%mul.3 = mul nsw i32 %conv2.3, %conv.3
				%arrayidx3.3 = getelementptr inbounds i32, i32* %C, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx3.3, align 4
				%inc.3 = add i32 %i.010, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

				; CHECK-LABEL: mul_8x8_2d
				; CHECK: @ %for.body4.us

				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: ldrb{{.*}}, #4]!

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: ldr{{.*}}, #4]!

				define void @mul_8x8_2d(i8* nocapture readonly %A, i8 nocapture readonly %B, i32 nocapture readonly %C, i32 %N, i32 %M) {
				entry:
				%cmp24 = icmp eq i32 %N, 0
				%cmp222 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp24, %cmp222
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.025.us = phi i32 [ %inc11.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i8, i8* %A, i32 %i.025.us
				%arrayidx5.us = getelementptr inbounds i8, i8* %B, i32 %i.025.us
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.025.us
				%.pre = load i8, i8* %arrayidx5.us, align 4
				%.pre30 = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%tmp2 = load i8, i8* %arrayidx.us, align 1
				%conv.us = zext i8 %tmp2 to i32
				%arrayidx6.us = getelementptr inbounds i8, i8* %.pre, i32 %j.023.us
				%tmp3 = load i8, i8* %arrayidx6.us, align 1
				%conv7.us = zext i8 %tmp3 to i32
				%mul.us = mul nuw nsw i32 %conv7.us, %conv.us
				%arrayidx9.us = getelementptr inbounds i32, i32* %.pre30, i32 %j.023.us
				%tmp4 = load i32, i32* %arrayidx9.us, align 4
				%add.us = add nsw i32 %tmp4, %mul.us
				store i32 %add.us, i32* %arrayidx9.us, align 4
				%inc.us = or i32 %j.023.us, 1
				%tmp5 = load i8, i8* %arrayidx.us, align 1
				%conv.us.1 = zext i8 %tmp5 to i32
				%arrayidx6.us.1 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us
				%tmp6 = load i8, i8* %arrayidx6.us.1, align 1
				%conv7.us.1 = zext i8 %tmp6 to i32
				%mul.us.1 = mul nuw nsw i32 %conv7.us.1, %conv.us.1
				%arrayidx9.us.1 = getelementptr inbounds i32, i32* %.pre30, i32 %inc.us
				%tmp7 = load i32, i32* %arrayidx9.us.1, align 4
				%add.us.1 = add nsw i32 %tmp7, %mul.us.1
				store i32 %add.us.1, i32* %arrayidx9.us.1, align 4
				%inc.us.1 = or i32 %j.023.us, 2
				%tmp8 = load i8, i8* %arrayidx.us, align 1
				%conv.us.2 = zext i8 %tmp8 to i32
				%arrayidx6.us.2 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.1
				%tmp9 = load i8, i8* %arrayidx6.us.2, align 1
				%conv7.us.2 = zext i8 %tmp9 to i32
				%mul.us.2 = mul nuw nsw i32 %conv7.us.2, %conv.us.2
				%arrayidx9.us.2 = getelementptr inbounds i32, i32* %.pre30, i32 %inc.us.1
				%tmp10 = load i32, i32* %arrayidx9.us.2, align 4
				%add.us.2 = add nsw i32 %tmp10, %mul.us.2
				store i32 %add.us.2, i32* %arrayidx9.us.2, align 4
				%inc.us.2 = or i32 %j.023.us, 3
				%tmp11 = load i8, i8* %arrayidx.us, align 1
				%conv.us.3 = zext i8 %tmp11 to i32
				%arrayidx6.us.3 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.2
				%tmp12 = load i8, i8* %arrayidx6.us.3, align 1
				%conv7.us.3 = zext i8 %tmp12 to i32
				%mul.us.3 = mul nuw nsw i32 %conv7.us.3, %conv.us.3
				%arrayidx9.us.3 = getelementptr inbounds i32, i32* %.pre30, i32 %inc.us.2
				%tmp13 = load i32, i32* %arrayidx9.us.3, align 4
				%add.us.3 = add nsw i32 %tmp13, %mul.us.3
				store i32 %add.us.3, i32* %arrayidx9.us.3, align 4
				%inc.us.3 = add i32 %j.023.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%j.023.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.023.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%tmp14 = load i8, i8* %arrayidx.us, align 1
				%conv.us.epil = zext i8 %tmp14 to i32
				%arrayidx6.us.epil = getelementptr inbounds i8, i8* %.pre, i32 %j.023.us.epil
				%tmp15 = load i8, i8* %arrayidx6.us.epil, align 1
				%conv7.us.epil = zext i8 %tmp15 to i32
				%mul.us.epil = mul nuw nsw i32 %conv7.us.epil, %conv.us.epil
				%arrayidx9.us.epil = getelementptr inbounds i32, i32* %.pre30, i32 %j.023.us.epil
				%tmp16 = load i32, i32* %arrayidx9.us.epil, align 4
				%add.us.epil = add nsw i32 %tmp16, %mul.us.epil
				store i32 %add.us.epil, i32* %arrayidx9.us.epil, align 4
				%inc.us.epil = add nuw i32 %j.023.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%inc11.us = add nuw i32 %i.025.us, 1
				%exitcond28 = icmp eq i32 %inc11.us, %N
				br i1 %exitcond28, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mul_16x16_2d
				; CHECK: @ %for.body4.us

				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]!

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: ldr{{.*}}, #4]!

				define void @mul_16x16_2d(i16* nocapture readonly %A, i16 nocapture readonly %B, i32 nocapture readonly %C, i32 %N, i32 %M) {
				entry:
				%cmp24 = icmp eq i32 %N, 0
				%cmp222 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp24, %cmp222
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.025.us = phi i32 [ %inc11.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i16, i16* %A, i32 %i.025.us
				%tmp2 = load i16, i16* %arrayidx.us, align 2
				%conv.us = sext i16 %tmp2 to i32
				%arrayidx5.us = getelementptr inbounds i16, i16* %B, i32 %i.025.us
				%tmp3 = load i16, i16* %arrayidx5.us, align 4
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.025.us
				%tmp4 = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%arrayidx6.us = getelementptr inbounds i16, i16* %tmp3, i32 %j.023.us
				%tmp5 = load i16, i16* %arrayidx6.us, align 2
				%conv7.us = sext i16 %tmp5 to i32
				%mul.us = mul nsw i32 %conv7.us, %conv.us
				%arrayidx9.us = getelementptr inbounds i32, i32* %tmp4, i32 %j.023.us
				%tmp6 = load i32, i32* %arrayidx9.us, align 4
				%add.us = add nsw i32 %tmp6, %mul.us
				store i32 %add.us, i32* %arrayidx9.us, align 4
				%inc.us = or i32 %j.023.us, 1
				%arrayidx6.us.1 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us
				%tmp7 = load i16, i16* %arrayidx6.us.1, align 2
				%conv7.us.1 = sext i16 %tmp7 to i32
				%mul.us.1 = mul nsw i32 %conv7.us.1, %conv.us
				%arrayidx9.us.1 = getelementptr inbounds i32, i32* %tmp4, i32 %inc.us
				%tmp8 = load i32, i32* %arrayidx9.us.1, align 4
				%add.us.1 = add nsw i32 %tmp8, %mul.us.1
				store i32 %add.us.1, i32* %arrayidx9.us.1, align 4
				%inc.us.1 = or i32 %j.023.us, 2
				%arrayidx6.us.2 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.1
				%tmp9 = load i16, i16* %arrayidx6.us.2, align 2
				%conv7.us.2 = sext i16 %tmp9 to i32
				%mul.us.2 = mul nsw i32 %conv7.us.2, %conv.us
				%arrayidx9.us.2 = getelementptr inbounds i32, i32* %tmp4, i32 %inc.us.1
				%tmp10 = load i32, i32* %arrayidx9.us.2, align 4
				%add.us.2 = add nsw i32 %tmp10, %mul.us.2
				store i32 %add.us.2, i32* %arrayidx9.us.2, align 4
				%inc.us.2 = or i32 %j.023.us, 3
				%arrayidx6.us.3 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.2
				%tmp11 = load i16, i16* %arrayidx6.us.3, align 2
				%conv7.us.3 = sext i16 %tmp11 to i32
				%mul.us.3 = mul nsw i32 %conv7.us.3, %conv.us
				%arrayidx9.us.3 = getelementptr inbounds i32, i32* %tmp4, i32 %inc.us.2
				%tmp12 = load i32, i32* %arrayidx9.us.3, align 4
				%add.us.3 = add nsw i32 %tmp12, %mul.us.3
				store i32 %add.us.3, i32* %arrayidx9.us.3, align 4
				%inc.us.3 = add i32 %j.023.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%j.023.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.023.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%arrayidx6.us.epil = getelementptr inbounds i16, i16* %tmp3, i32 %j.023.us.epil
				%tmp13 = load i16, i16* %arrayidx6.us.epil, align 2
				%conv7.us.epil = sext i16 %tmp13 to i32
				%mul.us.epil = mul nsw i32 %conv7.us.epil, %conv.us
				%arrayidx9.us.epil = getelementptr inbounds i32, i32* %tmp4, i32 %j.023.us.epil
				%tmp14 = load i32, i32* %arrayidx9.us.epil, align 4
				%add.us.epil = add nsw i32 %tmp14, %mul.us.epil
				store i32 %add.us.epil, i32* %arrayidx9.us.epil, align 4
				%inc.us.epil = add nuw i32 %j.023.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%inc11.us = add nuw i32 %i.025.us, 1
				%exitcond28 = icmp eq i32 %inc11.us, %N
				br i1 %exitcond28, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mac_8x8_2d
				; CHECK: @ %for.body4.us

				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}, #3]
				; CHECK-BASE: str{{.*}}, lsl #2]
				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}, #4]!
				; CHECK-BASE: str{{.*}}, lsl #2]
				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}, #1]
				; CHECK-BASE: str{{.*}}, lsl #2]
				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}, #2]
				; CHECK-BASE: str{{.*}}, lsl #2]

				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: str{{.*}}, lsl #2]
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}, #1]
				; CHECK-COMPLEX: str{{.*}}, lsl #2]
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}, #2]
				; CHECK-COMPLEX: str{{.*}}, lsl #2]
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}, #3]
				; CHECK-COMPLEX: str{{.*}}, lsl #2]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrb{{.*}}, #1]!

				define void @mac_8x8_2d(i8* nocapture readonly %A, i8** nocapture readonly %B, i32* nocapture %C, i32 %N, i32 %M) {
				entry:
				%cmp22 = icmp eq i32 %N, 0
				%cmp220 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp22, %cmp220
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.023.us = phi i32 [ %inc10.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i8, i8* %A, i32 %i.023.us
				%arrayidx5.us = getelementptr inbounds i8, i8* %B, i32 %i.023.us
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.023.us
				%.pre = load i8, i8* %arrayidx5.us, align 4
				%.pre28 = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%tmp2 = phi i32 [ %add.us.3, %for.body4.us ], [ %.pre28, %for.cond1.preheader.us ]
				%j.021.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%tmp3 = load i8, i8* %arrayidx.us, align 1
				%conv.us = zext i8 %tmp3 to i32
				%arrayidx6.us = getelementptr inbounds i8, i8* %.pre, i32 %j.021.us
				%tmp4 = load i8, i8* %arrayidx6.us, align 1
				%conv7.us = zext i8 %tmp4 to i32
				%mul.us = mul nuw nsw i32 %conv7.us, %conv.us
				%add.us = add nsw i32 %mul.us, %tmp2
				store i32 %add.us, i32* %arrayidx8.us, align 4
				%inc.us = or i32 %j.021.us, 1
				%tmp5 = load i8, i8* %arrayidx.us, align 1
				%conv.us.1 = zext i8 %tmp5 to i32
				%arrayidx6.us.1 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us
				%tmp6 = load i8, i8* %arrayidx6.us.1, align 1
				%conv7.us.1 = zext i8 %tmp6 to i32
				%mul.us.1 = mul nuw nsw i32 %conv7.us.1, %conv.us.1
				%add.us.1 = add nsw i32 %mul.us.1, %add.us
				store i32 %add.us.1, i32* %arrayidx8.us, align 4
				%inc.us.1 = or i32 %j.021.us, 2
				%tmp7 = load i8, i8* %arrayidx.us, align 1
				%conv.us.2 = zext i8 %tmp7 to i32
				%arrayidx6.us.2 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.1
				%tmp8 = load i8, i8* %arrayidx6.us.2, align 1
				%conv7.us.2 = zext i8 %tmp8 to i32
				%mul.us.2 = mul nuw nsw i32 %conv7.us.2, %conv.us.2
				%add.us.2 = add nsw i32 %mul.us.2, %add.us.1
				store i32 %add.us.2, i32* %arrayidx8.us, align 4
				%inc.us.2 = or i32 %j.021.us, 3
				%tmp9 = load i8, i8* %arrayidx.us, align 1
				%conv.us.3 = zext i8 %tmp9 to i32
				%arrayidx6.us.3 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.2
				%tmp10 = load i8, i8* %arrayidx6.us.3, align 1
				%conv7.us.3 = zext i8 %tmp10 to i32
				%mul.us.3 = mul nuw nsw i32 %conv7.us.3, %conv.us.3
				%add.us.3 = add nsw i32 %mul.us.3, %add.us.2
				store i32 %add.us.3, i32* %arrayidx8.us, align 4
				%inc.us.3 = add i32 %j.021.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%.unr = phi i32 [ %.pre28, %for.cond1.preheader.us ], [ %add.us.3, %for.body4.us ]
				%j.021.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%tmp11 = phi i32 [ %add.us.epil, %for.body4.us.epil ], [ %.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%j.021.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.021.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%tmp12 = load i8, i8* %arrayidx.us, align 1
				%conv.us.epil = zext i8 %tmp12 to i32
				%arrayidx6.us.epil = getelementptr inbounds i8, i8* %.pre, i32 %j.021.us.epil
				%tmp13 = load i8, i8* %arrayidx6.us.epil, align 1
				%conv7.us.epil = zext i8 %tmp13 to i32
				%mul.us.epil = mul nuw nsw i32 %conv7.us.epil, %conv.us.epil
				%add.us.epil = add nsw i32 %mul.us.epil, %tmp11
				store i32 %add.us.epil, i32* %arrayidx8.us, align 4
				%inc.us.epil = add nuw i32 %j.021.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%inc10.us = add nuw i32 %i.023.us, 1
				%exitcond26 = icmp eq i32 %inc10.us, %N
				br i1 %exitcond26, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mac_16x16_2d
				; CHECK: @ %for.body4.us

				; CHECK-BASE: ldrsh{{.*}}, #8]!
				; CHECK-BASE: ldrsh{{.*}}, #2]
				; CHECK-BASE: ldrsh{{.*}}, #4]
				; CHECK-BASE: ldrsh{{.*}}, #6]

				; CHECK-COMPLEX: ldrsh{{.*}}, lsl #1]
				; CHECK-COMPLEX: ldrsh{{.*}}, #2]
				; CHECK-COMPLEX: ldrsh{{.*}}, #4]
				; CHECK-COMPLEX: ldrsh{{.*}}, #6]

				; DISABLED-NOT: ldr{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!

				define void @mac_16x16_2d(i16* nocapture readonly %A, i16** nocapture readonly %B, i32* nocapture %C, i32 %N, i32 %M) {
				entry:
				%cmp23 = icmp eq i32 %N, 0
				%cmp220 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp23, %cmp220
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.024.us = phi i32 [ %inc10.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i16, i16* %A, i32 %i.024.us
				%tmp2 = load i16, i16* %arrayidx.us, align 2
				%conv.us = sext i16 %tmp2 to i32
				%arrayidx5.us = getelementptr inbounds i16, i16* %B, i32 %i.024.us
				%tmp3 = load i16, i16* %arrayidx5.us, align 4
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.024.us
				%arrayidx8.promoted.us = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%add22.us = phi i32 [ %add.us.3, %for.body4.us ], [ %arrayidx8.promoted.us, %for.cond1.preheader.us ]
				%j.021.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%arrayidx6.us = getelementptr inbounds i16, i16* %tmp3, i32 %j.021.us
				%tmp4 = load i16, i16* %arrayidx6.us, align 2
				%conv7.us = sext i16 %tmp4 to i32
				%mul.us = mul nsw i32 %conv7.us, %conv.us
				%add.us = add nsw i32 %mul.us, %add22.us
				%inc.us = or i32 %j.021.us, 1
				%arrayidx6.us.1 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us
				%tmp5 = load i16, i16* %arrayidx6.us.1, align 2
				%conv7.us.1 = sext i16 %tmp5 to i32
				%mul.us.1 = mul nsw i32 %conv7.us.1, %conv.us
				%add.us.1 = add nsw i32 %mul.us.1, %add.us
				%inc.us.1 = or i32 %j.021.us, 2
				%arrayidx6.us.2 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.1
				%tmp6 = load i16, i16* %arrayidx6.us.2, align 2
				%conv7.us.2 = sext i16 %tmp6 to i32
				%mul.us.2 = mul nsw i32 %conv7.us.2, %conv.us
				%add.us.2 = add nsw i32 %mul.us.2, %add.us.1
				%inc.us.2 = or i32 %j.021.us, 3
				%arrayidx6.us.3 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.2
				%tmp7 = load i16, i16* %arrayidx6.us.3, align 2
				%conv7.us.3 = sext i16 %tmp7 to i32
				%mul.us.3 = mul nsw i32 %conv7.us.3, %conv.us
				%add.us.3 = add nsw i32 %mul.us.3, %add.us.2
				%inc.us.3 = add i32 %j.021.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%add.us.lcssa.ph = phi i32 [ undef, %for.cond1.preheader.us ], [ %add.us.3, %for.body4.us ]
				%add22.us.unr = phi i32 [ %arrayidx8.promoted.us, %for.cond1.preheader.us ], [ %add.us.3, %for.body4.us ]
				%j.021.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%add22.us.epil = phi i32 [ %add.us.epil, %for.body4.us.epil ], [ %add22.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%j.021.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.021.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%arrayidx6.us.epil = getelementptr inbounds i16, i16* %tmp3, i32 %j.021.us.epil
				%tmp8 = load i16, i16* %arrayidx6.us.epil, align 2
				%conv7.us.epil = sext i16 %tmp8 to i32
				%mul.us.epil = mul nsw i32 %conv7.us.epil, %conv.us
				%add.us.epil = add nsw i32 %mul.us.epil, %add22.us.epil
				%inc.us.epil = add nuw i32 %j.021.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%add.us.lcssa = phi i32 [ %add.us.lcssa.ph, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ], [ %add.us.epil, %for.body4.us.epil ]
				store i32 %add.us.lcssa, i32* %arrayidx8.us, align 4
				%inc10.us = add nuw i32 %i.024.us, 1
				%exitcond27 = icmp eq i32 %inc10.us, %N
				br i1 %exitcond27, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mul32x32_backwards
				; CHECK: @ %for.body

				; TODO: post increments for decreasing addresses
				; CHECK-DEFAULT-NOT: ldr{{.*}}]!
				; CHECK-DEFAULT-NOT: str{{.*}}]!

				; CHECK-COMPLEX-NOT: ldr{{.*}}]!
				; CHECK-COMPLEX-NOT: str{{.*}}]!

				define void @mul32x32_backwards(i32* nocapture %a, i32* nocapture readonly %b, i32* nocapture readonly %c, i32 %N) {
				entry:
				%i.08 = add i32 %N, -1
				%cmp9 = icmp sgt i32 %i.08, -1
				br i1 %cmp9, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%xtraiter = and i32 %N, 3
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.body.prol.loopexit, label %for.body.prol

				for.body.prol: ; preds = %for.body.prol, %for.body.preheader
				%i.010.prol = phi i32 [ %i.0.prol, %for.body.prol ], [ %i.08, %for.body.preheader ]
				%prol.iter = phi i32 [ %prol.iter.sub, %for.body.prol ], [ %xtraiter, %for.body.preheader ]
				%arrayidx.prol = getelementptr inbounds i32, i32* %b, i32 %i.010.prol
				%tmp = load i32, i32* %arrayidx.prol, align 4
				%arrayidx1.prol = getelementptr inbounds i32, i32* %c, i32 %i.010.prol
				%tmp1 = load i32, i32* %arrayidx1.prol, align 4
				%mul.prol = mul nsw i32 %tmp1, %tmp
				%arrayidx2.prol = getelementptr inbounds i32, i32* %a, i32 %i.010.prol
				store i32 %mul.prol, i32* %arrayidx2.prol, align 4
				%i.0.prol = add i32 %i.010.prol, -1
				%prol.iter.sub = add i32 %prol.iter, -1
				%prol.iter.cmp = icmp eq i32 %prol.iter.sub, 0
				br i1 %prol.iter.cmp, label %for.body.prol.loopexit, label %for.body.prol

				for.body.prol.loopexit: ; preds = %for.body.prol, %for.body.preheader
				%i.010.unr = phi i32 [ %i.08, %for.body.preheader ], [ %i.0.prol, %for.body.prol ]
				%tmp2 = icmp ult i32 %i.08, 3
				br i1 %tmp2, label %for.cond.cleanup, label %for.body

				for.cond.cleanup: ; preds = %for.body, %for.body.prol.loopexit, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.prol.loopexit
				%i.010 = phi i32 [ %i.0.3, %for.body ], [ %i.010.unr, %for.body.prol.loopexit ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.010
				%tmp3 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %c, i32 %i.010
				%tmp4 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %tmp4, %tmp3
				%arrayidx2 = getelementptr inbounds i32, i32* %a, i32 %i.010
				store i32 %mul, i32* %arrayidx2, align 4
				%i.0 = add i32 %i.010, -1
				%arrayidx.1 = getelementptr inbounds i32, i32* %b, i32 %i.0
				%tmp5 = load i32, i32* %arrayidx.1, align 4
				%arrayidx1.1 = getelementptr inbounds i32, i32* %c, i32 %i.0
				%tmp6 = load i32, i32* %arrayidx1.1, align 4
				%mul.1 = mul nsw i32 %tmp6, %tmp5
				%arrayidx2.1 = getelementptr inbounds i32, i32* %a, i32 %i.0
				store i32 %mul.1, i32* %arrayidx2.1, align 4
				%i.0.1 = add i32 %i.010, -2
				%arrayidx.2 = getelementptr inbounds i32, i32* %b, i32 %i.0.1
				%tmp7 = load i32, i32* %arrayidx.2, align 4
				%arrayidx1.2 = getelementptr inbounds i32, i32* %c, i32 %i.0.1
				%tmp8 = load i32, i32* %arrayidx1.2, align 4
				%mul.2 = mul nsw i32 %tmp8, %tmp7
				%arrayidx2.2 = getelementptr inbounds i32, i32* %a, i32 %i.0.1
				store i32 %mul.2, i32* %arrayidx2.2, align 4
				%i.0.2 = add i32 %i.010, -3
				%arrayidx.3 = getelementptr inbounds i32, i32* %b, i32 %i.0.2
				%tmp9 = load i32, i32* %arrayidx.3, align 4
				%arrayidx1.3 = getelementptr inbounds i32, i32* %c, i32 %i.0.2
				%tmp10 = load i32, i32* %arrayidx1.3, align 4
				%mul.3 = mul nsw i32 %tmp10, %tmp9
				%arrayidx2.3 = getelementptr inbounds i32, i32* %a, i32 %i.0.2
				store i32 %mul.3, i32* %arrayidx2.3, align 4
				%i.0.3 = add i32 %i.010, -4
				%cmp.3 = icmp sgt i32 %i.0.3, -1
				br i1 %cmp.3, label %for.body, label %for.cond.cleanup
				}

				; CHECK-LABEL: mul32x32_forwards
				; CHECK: @ %for.body

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: str{{.*}}, #12]

				; CHECK-COMPLEX: ldr{{.*}}, #16]!
				; CHECK-COMPLEX: ldr{{.*}}, #16]!
				; CHECK-COMPLEX: str{{.*}}, #16]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldr{{.*}}, #4]!
				; CHECK-T2: ldr{{.*}}, #4]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul32x32_forwards(i32* nocapture %a, i32* nocapture readonly %b, i32* nocapture readonly %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				br i1 %cmp8, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.09.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.09.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.09.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i32, i32* %b, i32 %i.09.epil
				%tmp2 = load i32, i32* %arrayidx.epil, align 4
				%arrayidx1.epil = getelementptr inbounds i32, i32* %c, i32 %i.09.epil
				%tmp3 = load i32, i32* %arrayidx1.epil, align 4
				%mul.epil = mul nsw i32 %tmp3, %tmp2
				%arrayidx2.epil = getelementptr inbounds i32, i32* %a, i32 %i.09.epil
				store i32 %mul.epil, i32* %arrayidx2.epil, align 4
				%inc.epil = add nuw nsw i32 %i.09.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.09 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.09
				%tmp4 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %c, i32 %i.09
				%tmp5 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %tmp5, %tmp4
				%arrayidx2 = getelementptr inbounds i32, i32* %a, i32 %i.09
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = or i32 %i.09, 1
				%arrayidx.1 = getelementptr inbounds i32, i32* %b, i32 %inc
				%tmp6 = load i32, i32* %arrayidx.1, align 4
				%arrayidx1.1 = getelementptr inbounds i32, i32* %c, i32 %inc
				%tmp7 = load i32, i32* %arrayidx1.1, align 4
				%mul.1 = mul nsw i32 %tmp7, %tmp6
				%arrayidx2.1 = getelementptr inbounds i32, i32* %a, i32 %inc
				store i32 %mul.1, i32* %arrayidx2.1, align 4
				%inc.1 = or i32 %i.09, 2
				%arrayidx.2 = getelementptr inbounds i32, i32* %b, i32 %inc.1
				%tmp8 = load i32, i32* %arrayidx.2, align 4
				%arrayidx1.2 = getelementptr inbounds i32, i32* %c, i32 %inc.1
				%tmp9 = load i32, i32* %arrayidx1.2, align 4
				%mul.2 = mul nsw i32 %tmp9, %tmp8
				%arrayidx2.2 = getelementptr inbounds i32, i32* %a, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx2.2, align 4
				%inc.2 = or i32 %i.09, 3
				%arrayidx.3 = getelementptr inbounds i32, i32* %b, i32 %inc.2
				%tmp10 = load i32, i32* %arrayidx.3, align 4
				%arrayidx1.3 = getelementptr inbounds i32, i32* %c, i32 %inc.2
				%tmp11 = load i32, i32* %arrayidx1.3, align 4
				%mul.3 = mul nsw i32 %tmp11, %tmp10
				%arrayidx2.3 = getelementptr inbounds i32, i32* %a, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx2.3, align 4
				%inc.3 = add nuw nsw i32 %i.09, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

llvm/trunk/test/Transforms/LoopStrengthReduce/ARM/complexity.ll

	target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"

	; RUN: opt -mtriple=thumbv7em %s -S -loop-reduce -lsr-complexity-limit=65536 -o - \| FileCheck %s --check-prefix=CHECK-DEFAULT			; RUN: opt -mtriple=thumbv7em %s -S -loop-reduce -lsr-complexity-limit=65536 -o - \| FileCheck %s
	; RUN: opt -mtriple=thumbv7em %s -S -loop-reduce -lsr-complexity-limit=2147483647 -o - \| FileCheck %s --check-prefix=CHECK-COMPLEX			; RUN: opt -mtriple=thumbv7em %s -S -loop-reduce -lsr-complexity-limit=2147483647 -o - \| FileCheck %s

	; CHECK-DEFAULT-LABEL: for.body12.us.us:			; CHECK-LABEL: for.body12.us.us:
	; CHECK-DEFAULT: phi i32			; CHECK: [[LSR_IV6:%[^ ]+]] = phi i16* [ [[SCEVGEP7:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP5:%[^ ]+]], %for.cond9.preheader.us.us ]
	; CHECK-DEFAULT: [[LSR_IV:%[^ ]+]] = phi i32 [ [[LSR_IV_NEXT:%[^ ]+]], %for.body12.us.us ], [ 0, %for.cond9.preheader.us.us ]			; CHECK: phi i32
	; CHECK-DEFAULT: phi i32			; CHECK: [[LSR_IV:%[^ ]+]] = phi i16* [ [[SCEVGEP1:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP:%[^ ]+]], %for.cond9.preheader.us.us ]
	; CHECK-DEFAULT: [[LSR_IV_NEXT]] = add i32 [[LSR_IV]], 8			; CHECK: phi i32
				; CHECK: [[SCEVGEP1]] = getelementptr i16, i16* [[LSR_IV]], i32 4
	; CHECK-COMPLEX-LABEL: for.body12.us.us:			; CHECK: [[SCEVGEP7]] = getelementptr i16, i16* [[LSR_IV6]], i32 4
	; CHECK-COMPLEX: phi i32
	; CHECK-COMPLEX: [[LSR_IV6:%[^ ]+]] = phi i16* [ [[SCEVGEP7:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP5:%[^ ]+]], %for.cond9.preheader.us.us ]
	; CHECK-COMPLEX: [[LSR_IV:%[^ ]+]] = phi i16* [ [[SCEVGEP1:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP:%[^ ]+]], %for.cond9.preheader.us.us ]
	; CHECK-COMPLEX: phi i32
	; CHECK-COMPLEX: [[SCEVGEP1]] = getelementptr i16, i16* [[LSR_IV]], i32 4
	; CHECK-COMPLEX: [[SCEVGEP7]] = getelementptr i16, i16* [[LSR_IV6]], i32 4

	define void @convolve(i16 nocapture readonly %input_image, i16 nocapture readonly %filter, i32 %filter_dim, i32 %out_width, i32 %out_height, i32** nocapture readonly %convolved) {			define void @convolve(i16 nocapture readonly %input_image, i16 nocapture readonly %filter, i32 %filter_dim, i32 %out_width, i32 %out_height, i32** nocapture readonly %convolved) {
	entry:			entry:
	%cmp92 = icmp eq i32 %out_height, 0			%cmp92 = icmp eq i32 %out_height, 0
	br i1 %cmp92, label %for.cond.cleanup, label %for.cond1.preheader.lr.ph			br i1 %cmp92, label %for.cond.cleanup, label %for.cond1.preheader.lr.ph

	for.cond1.preheader.lr.ph: ; preds = %entry			for.cond1.preheader.lr.ph: ; preds = %entry
	%xtraiter = and i32 %filter_dim, 3			%xtraiter = and i32 %filter_dim, 3
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LSR] Generate formulae to enable more indexed accessesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 185750

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/trunk/lib/Transforms/Scalar/LoopStrengthReduce.cpp

llvm/trunk/test/CodeGen/ARM/dsp-loop-indexing.ll

llvm/trunk/test/CodeGen/ARM/loop-align-cortex-m.ll

llvm/trunk/test/CodeGen/ARM/loop-indexing.ll

llvm/trunk/test/Transforms/LoopStrengthReduce/ARM/complexity.ll

[LSR] Generate formulae to enable more indexed accesses
ClosedPublic