This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
1
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/ARM/
-
ARM/
3/6
ARMTargetTransformInfo.h
-
Transforms/Scalar/
-
Scalar/
7/19
LoopStrengthReduce.cpp
-
test/
-
CodeGen/ARM/
-
ARM/
-
dsp-loop-indexing.ll
1/2
loop-align-cortex-m.ll
-
loop-indexing.ll
-
Transforms/LoopStrengthReduce/ARM/
-
LoopStrengthReduce/
-
ARM/
1/2
complexity.ll

Differential D55373

[LSR] Generate formulae to enable more indexed accesses
ClosedPublic

Authored by samparker on Dec 6 2018, 8:08 AM.

Download Raw Diff

Details

Reviewers

qcolombet
gilr
kparzysz

Commits

rG67756c09f21a: [LSR] Generate cross iteration indexes
rL353403: [LSR] Generate cross iteration indexes

Summary

Modify GenerateConstantOffsetsImpl to create offsets that can be used by indexed addressing modes. If formulae can be generated which result in the constant offset being the same size as the recurrence, we can generate an indexed access.

The resulting code, at least for Arm, is that usually pre-indexed loads are used as the last access, but sometimes the first.

@kparzysz Would you be able to provide feedback on how this effects Hexagon? It's a target that I don't build and I haven't looked at the tests, but I'm assuming this would interest you.

Diff Detail

Event Timeline

samparker created this revision.Dec 6 2018, 8:08 AM

Herald added subscribers: kristof.beyls, javed.absar. · View Herald TranscriptDec 6 2018, 8:08 AM

gilr added inline comments.Dec 9 2018, 8:49 AM

test/CodeGen/ARM/loop-align-cortex-m.ll
3	You're just fixing this by the way, right?
test/Transforms/LoopStrengthReduce/ARM/complexity.ll
10–22	Shouldn't -DEFAULT be removed here too?

dmgreen added a subscriber: dmgreen.Dec 10 2018, 3:26 AM

samparker marked 2 inline comments as done.Dec 11 2018, 12:51 AM

samparker added inline comments.

test/CodeGen/ARM/loop-align-cortex-m.ll
3	yes
test/Transforms/LoopStrengthReduce/ARM/complexity.ll
10–22	thanks!

I've run some tests and the results are not great for us. On some tests we got up to 5.5% improvements, but there are a lot of severe degradations (15+% worse). If this patch goes in, we'd like to be able to opt out.

Okay, thanks. We're also seeing some regressions, so I know I've got some tuning to do. Do you have any idea of the characteristics of your regressions? At the moment I'm thinking:

That the costs that I've added here are overly simplistic, for one I think I need to add a setup cost.
It's also probably not worth doing when we know that the loop iteration count is low.
In the current state, we also see code size regressions whereas your previous work helps us reduce code size. It may mean that I'll need a different flag to enable this change, but it also maybe a symptom of the performance regressions.

gilr added inline comments.Dec 12 2018, 12:13 AM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3793	Is this relevant for non-Address kind formulae?
3793	Worth a comment here to describe the motivation and how adjusting the offset generates the post-inc opportunity.
3795	How does a non-constant step (e.g. 50 + %x) translate to post-inc? Could you add such a test case?

samparker marked 2 inline comments as done.Dec 12 2018, 1:09 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3793	No. I had naively assumed that this function was only generating formulae for addresses... so if that's not the case, I'll make the change here.
3795	Ah, good catch. This should only trigger for constant steps.

I've moved the logic under the control of a new TTI flag as it seems that the current shouldFavourPostInc is trying to achieve different things. Hopefully I've also addressed Gil's comments.

gilr added inline comments.Dec 16 2018, 5:56 AM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1258	The existing `{LI, +, C}` pattern seems to already match this case (i.e. `{(-C + %a + ...), +, C}`) and the more general case `{(Offset - C2) + %a + ...), +, C2}`, where Offset =/= 0. So IIUC matching this pattern here is only needed if the new TTI flag is set but the existing one is reset, right? Won't this also match something like `{3, +, 5, +, %x}`?
1384	IIRC single-statement clauses shouldn't get curly braces.
3802	The new TTI API relates to the same HW feature, right? Why not use a cl::opt in LSR to turn this optimization off?

samparker marked an inline comment as done.Dec 19 2018, 1:27 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1258	I'm going to need to replace this with something that compares the step with the base offset, aiming to have the post increment happen on the last, not the first, access. I need to make a few other changes to support this though as well.

samparker marked an inline comment as done.Dec 19 2018, 4:49 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3802	When I've got the next patch ready, I will re-run my tests and see if this is viable. It may not be good for all targets, and if so, I think a target hook would be more useful.

Added a command line option to 'EnableBackedgePostIncs'.
Added a command line option to enable narrowing the search space by collapsing unrolled code.
The TTI hook now accepts the loop so that the target can make a more informed decision on when it 'shouldFavorBackedgePostIncs'.
LSRInstance contains a boolean 'FavorBackedgePostInc' which is equal to EnableBackedgePostIncs && shouldFavorBackedgePostIncs.
When FavorBackedgePostIncs:
- Generate the new constant offsets.
- In RateRegister, the LoopCost is now set to 0 if the step recurrence is equal to the base offset of the parent formula.
- IsProfitableChain has a higher limit
- The last expression in the IVChain is not added to IVIncSet so it's a target for optimising.

samparker retitled this revision from [LSR] Generate formulae to enable more post-incs to [LSR] Generate formulae to enable more indexed accesses.Jan 11 2019, 7:22 AM

samparker edited the summary of this revision. (Show Details)

Herald added a subscriber: arphaman. · View Herald TranscriptJan 11 2019, 7:22 AM

some renaming, renamed post inc bits to 'index'.

Changed the default value for the command line option 'EnableBackedgeIndexing' to false.

Removed changes to the complexity.ll test.

samparker added a child revision: D56719: [DAGCombine] Enable more pre-indexed stores.Jan 15 2019, 6:57 AM

samparker removed a child revision: D56719: [DAGCombine] Enable more pre-indexed stores.Jan 16 2019, 6:30 AM

ping

In D55373#1367469, @samparker wrote:

ping

Are you still seeing regressions? Can you provide performance data with the current patch?

Hi Hal,

I've attached a graph to show geomean performance results of a popular embedded benchmark suite running on Arm microcontroller. The suite is broken into five sub-suites and only three of which are affected by these changes. The large regression comes from a test which often exhibits bi-modal behaviour, so I could be getting unlucky. The whole suite comprises of several dozen tests and I only four of the benchmarks have minor regressions across the three runs. There's also some backend work that I need to do for Arm to get this optimised properly.

gilr added inline comments.Jan 28 2019, 5:52 AM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159	False by default?
4466–4467	The convention for flags controlling the narrowing heuristics seem to be to use them in NarrowSearchSpaceUsingHeuristics() rather than in the the functions they affect.
4466–4467	I think you meant '\|\|' here.

samparker marked an inline comment as done.Jan 29 2019, 1:20 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159	This option has been introduced to force the collapsing, even if the complexity limit hasn't been reached. Which is why it's implemented differently to the other, similar, options.

gilr added inline comments.Feb 3 2019, 2:27 PM

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159	Aaargh, sorry, read it backwards ... Is it needed since unrolled code doesn't initially go into the same LSRUse? If so, can FavorBackedgeIndex be used to have initial construction put them in the same LSRUse?
2904	Deserves a comment.
3121	I'm a bit confused here. IIUC a profitable complete chain prvides an efficient solution using post-increments. Why break the last increment out?

samparker marked 2 inline comments as done.Feb 4 2019, 3:36 AM

samparker added inline comments.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
159	I will look into this, thanks.
3121	This was so that the last access could also be a target for optimisation. I'll remove the change and do some testing again to see the difference.

Made some simplifications:

reset isProfitableChain.
reset FinalizeChain so that the tail is added to the chain again.
removed the CollapseUnrolled option because with the reset in changes, it wasn't really interesting.

I'll post the updated performance numbers.

gilr added inline comments.Feb 6 2019, 12:40 PM

include/llvm/Analysis/TargetTransformInfo.h
490	Please add a doxygen comment.
lib/Target/ARM/ARMTargetTransformInfo.h
98	Is this optimization inherently code-size unfriendly for ARM? (The patch actually reduces the instruction count in LSR's Cost when this optimization kicks in)
100	Is the single-block constraint due to CodeGen's single-block optimization scope? (If so, then IINM it's not target-specific)

samparker marked 2 inline comments as done.Feb 7 2019, 1:35 AM

samparker added inline comments.

lib/Target/ARM/ARMTargetTransformInfo.h
98	There's two reasons really: the transform is most useful in 'unrolled' loops (which we disable when optimising for code size), and this transform will introduce instructions into the preheader and if the address can't be kept in the same register, we'll also produce moves. So this is mainly a defensive restriction, because I haven't been tracking code size, but I would hope that I can remove the restriction later.
100	No, its not because of ISel restrictions or anything like that. It's because the transform is only likely to be useful is the address can be kept in the same register - which becomes increasingly less likely once multiple blocks are considered.

Added doxygen comment.

gilr added inline comments.Feb 7 2019, 2:26 AM

lib/Target/ARM/ARMTargetTransformInfo.h
100	So both the code-size and single-block heuristics seem target-independent. Why not do this in LSR?

samparker marked an inline comment as done.Feb 7 2019, 2:58 AM

samparker added inline comments.

lib/Target/ARM/ARMTargetTransformInfo.h
100	I'd argue because different backends would come to different conclusions to me. I've gone for a very simplistic heuristic, but it would be good to consider register pressure rather than just the number of blocks. Also, some targets may not support indexed accesses on certain types of memory operations and so it worthless generating formulae for them.

LGTM!
(One last nitpick: I'd consider letting EnableBackedgeIndexing default to true as TTI's default already disables it for Hexagon and all other targets)

This revision is now accepted and ready to land.Feb 7 2019, 4:28 AM

Will do. Thanks for the review!

Closed by commit rL353403: [LSR] Generate cross iteration indexes (authored by sam_parker). · Explain WhyFeb 7 2019, 5:33 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 5:33 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

6 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

ARM/

ARMTargetTransformInfo.h

6 lines

Transforms/

Scalar/

LoopStrengthReduce.cpp

114 lines

test/

CodeGen/

ARM/

dsp-loop-indexing.ll

324 lines

loop-align-cortex-m.ll

4 lines

loop-indexing.ll

1236 lines

Transforms/

LoopStrengthReduce/

ARM/

complexity.ll

32 lines

Diff 181270

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 481 Lines • ▼ Show 20 Lines	public:
/// Loop-strength-reduction (LSR) uses that knowledge to adjust its cost		/// Loop-strength-reduction (LSR) uses that knowledge to adjust its cost
/// calculation for the instructions in a loop.		/// calculation for the instructions in a loop.
bool canMacroFuseCmp() const;		bool canMacroFuseCmp() const;

/// \return True is LSR should make efforts to create/preserve post-inc		/// \return True is LSR should make efforts to create/preserve post-inc
/// addressing mode expressions.		/// addressing mode expressions.
bool shouldFavorPostInc() const;		bool shouldFavorPostInc() const;

		bool shouldFavorBackedgeIndex(const Loop *L) const;
		gilrUnsubmitted Not Done Reply Inline Actions Please add a doxygen comment. gilr: Please add a doxygen comment.

/// Return true if the target supports masked load/store		/// Return true if the target supports masked load/store
/// AVX2 and AVX-512 targets allow masks for consecutive load and store		/// AVX2 and AVX-512 targets allow masks for consecutive load and store
bool isLegalMaskedStore(Type *DataType) const;		bool isLegalMaskedStore(Type *DataType) const;
bool isLegalMaskedLoad(Type *DataType) const;		bool isLegalMaskedLoad(Type *DataType) const;

/// Return true if the target supports masked gather/scatter		/// Return true if the target supports masked gather/scatter
/// AVX-512 fully supports gather and scatter for vectors with 32 and 64		/// AVX-512 fully supports gather and scatter for vectors with 32 and 64
/// bits scalar type.		/// bits scalar type.
▲ Show 20 Lines • Show All 555 Lines • ▼ Show 20 Lines	virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace,		unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) = 0;		TargetTransformInfo::LSRCost &C2) = 0;
virtual bool canMacroFuseCmp() = 0;		virtual bool canMacroFuseCmp() = 0;
virtual bool shouldFavorPostInc() const = 0;		virtual bool shouldFavorPostInc() const = 0;
		virtual bool shouldFavorBackedgeIndex(const Loop *L) const = 0;
virtual bool isLegalMaskedStore(Type *DataType) = 0;		virtual bool isLegalMaskedStore(Type *DataType) = 0;
virtual bool isLegalMaskedLoad(Type *DataType) = 0;		virtual bool isLegalMaskedLoad(Type *DataType) = 0;
virtual bool isLegalMaskedScatter(Type *DataType) = 0;		virtual bool isLegalMaskedScatter(Type *DataType) = 0;
virtual bool isLegalMaskedGather(Type *DataType) = 0;		virtual bool isLegalMaskedGather(Type *DataType) = 0;
virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;		virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;
virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;		virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;
virtual bool prefersVectorizedAddressing() = 0;		virtual bool prefersVectorizedAddressing() = 0;
virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
▲ Show 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
return Impl.isLSRCostLess(C1, C2);		return Impl.isLSRCostLess(C1, C2);
}		}
bool canMacroFuseCmp() override {		bool canMacroFuseCmp() override {
return Impl.canMacroFuseCmp();		return Impl.canMacroFuseCmp();
}		}
bool shouldFavorPostInc() const override {		bool shouldFavorPostInc() const override {
return Impl.shouldFavorPostInc();		return Impl.shouldFavorPostInc();
}		}
		bool shouldFavorBackedgeIndex(const Loop *L) const override {
		return Impl.shouldFavorBackedgeIndex(L);
		}
bool isLegalMaskedStore(Type *DataType) override {		bool isLegalMaskedStore(Type *DataType) override {
return Impl.isLegalMaskedStore(DataType);		return Impl.isLegalMaskedStore(DataType);
}		}
bool isLegalMaskedLoad(Type *DataType) override {		bool isLegalMaskedLoad(Type *DataType) override {
return Impl.isLegalMaskedLoad(DataType);		return Impl.isLegalMaskedLoad(DataType);
}		}
bool isLegalMaskedScatter(Type *DataType) override {		bool isLegalMaskedScatter(Type *DataType) override {
return Impl.isLegalMaskedScatter(DataType);		return Impl.isLegalMaskedScatter(DataType);
▲ Show 20 Lines • Show All 407 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	return std::tie(C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds,
std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,		std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,
C2.ScaleCost, C2.ImmCost, C2.SetupCost);		C2.ScaleCost, C2.ImmCost, C2.SetupCost);
}		}

bool canMacroFuseCmp() { return false; }		bool canMacroFuseCmp() { return false; }

bool shouldFavorPostInc() const { return false; }		bool shouldFavorPostInc() const { return false; }

		bool shouldFavorBackedgeIndex(const Loop *L) const { return false; }

bool isLegalMaskedStore(Type *DataType) { return false; }		bool isLegalMaskedStore(Type *DataType) { return false; }

bool isLegalMaskedLoad(Type *DataType) { return false; }		bool isLegalMaskedLoad(Type *DataType) { return false; }

bool isLegalMaskedScatter(Type *DataType) { return false; }		bool isLegalMaskedScatter(Type *DataType) { return false; }

bool isLegalMaskedGather(Type *DataType) { return false; }		bool isLegalMaskedGather(Type *DataType) { return false; }

▲ Show 20 Lines • Show All 595 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::canMacroFuseCmp() const {			bool TargetTransformInfo::canMacroFuseCmp() const {
	return TTIImpl->canMacroFuseCmp();			return TTIImpl->canMacroFuseCmp();
	}			}

	bool TargetTransformInfo::shouldFavorPostInc() const {			bool TargetTransformInfo::shouldFavorPostInc() const {
	return TTIImpl->shouldFavorPostInc();			return TTIImpl->shouldFavorPostInc();
	}			}

				bool TargetTransformInfo::shouldFavorBackedgeIndex(const Loop *L) const {
				return TTIImpl->shouldFavorBackedgeIndex(L);
				}

	bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {
	return TTIImpl->isLegalMaskedStore(DataType);			return TTIImpl->isLegalMaskedStore(DataType);
	}			}

	bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {
	return TTIImpl->isLegalMaskedLoad(DataType);			return TTIImpl->isLegalMaskedLoad(DataType);
	}			}

	▲ Show 20 Lines • Show All 1,041 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	explicit ARMTTIImpl(const ARMBaseTargetMachine *TM, const Function &F)
: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),		: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),
TLI(ST->getTargetLowering()) {}		TLI(ST->getTargetLowering()) {}

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

		bool shouldFavorBackedgeIndex(const Loop *L) const {
		if (L->getHeader()->getParent()->optForSize())
		gilrUnsubmitted Not Done Reply Inline Actions Is this optimization inherently code-size unfriendly for ARM? (The patch actually reduces the instruction count in LSR's Cost when this optimization kicks in) gilr: Is this optimization inherently code-size unfriendly for ARM? (The patch actually reduces the…
		samparkerAuthorUnsubmitted Done Reply Inline Actions There's two reasons really: the transform is most useful in 'unrolled' loops (which we disable when optimising for code size), and this transform will introduce instructions into the preheader and if the address can't be kept in the same register, we'll also produce moves. So this is mainly a defensive restriction, because I haven't been tracking code size, but I would hope that I can remove the restriction later. samparker: There's two reasons really: the transform is most useful in 'unrolled' loops (which we disable…
		return false;
		return ST->isMClass() && ST->isThumb2() && L->getNumBlocks() == 1;
		gilrUnsubmitted Not Done Reply Inline Actions Is the single-block constraint due to CodeGen's single-block optimization scope? (If so, then IINM it's not target-specific) gilr: Is the single-block constraint due to CodeGen's single-block optimization scope? (If so, then…
		samparkerAuthorUnsubmitted Done Reply Inline Actions No, its not because of ISel restrictions or anything like that. It's because the transform is only likely to be useful is the address can be kept in the same register - which becomes increasingly less likely once multiple blocks are considered. samparker: No, its not because of ISel restrictions or anything like that. It's because the transform is…
		gilrUnsubmitted Not Done Reply Inline Actions So both the code-size and single-block heuristics seem target-independent. Why not do this in LSR? gilr: So both the code-size and single-block heuristics seem target-independent. Why not do this in…
		samparkerAuthorUnsubmitted Done Reply Inline Actions I'd argue because different backends would come to different conclusions to me. I've gone for a very simplistic heuristic, but it would be good to consider register pressure rather than just the number of blocks. Also, some targets may not support indexed accesses on certain types of memory operations and so it worthless generating formulae for them. samparker: I'd argue because different backends would come to different conclusions to me. I've gone for a…
		}

/// Floating-point computation using ARMv8 AArch32 Advanced		/// Floating-point computation using ARMv8 AArch32 Advanced
/// SIMD instructions remains unchanged from ARMv7. Only AArch64 SIMD		/// SIMD instructions remains unchanged from ARMv7. Only AArch64 SIMD
/// is IEEE-754 compliant, but it's not covered in this target.		/// is IEEE-754 compliant, but it's not covered in this target.
bool isFPVectorizationPotentiallyUnsafe() {		bool isFPVectorizationPotentiallyUnsafe() {
return !ST->isTargetDarwin();		return !ST->isTargetDarwin();
}		}

/// \name Scalar TTI Implementations		/// \name Scalar TTI Implementations
▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

lib/Transforms/Scalar/LoopStrengthReduce.cpp

Show First 20 Lines • Show All 149 Lines • ▼ Show 20 Lines

// Flag to narrow search space by filtering non-optimal formulae with		// Flag to narrow search space by filtering non-optimal formulae with
// the same ScaledReg and Scale.		// the same ScaledReg and Scale.
static cl::opt<bool> FilterSameScaledReg(		static cl::opt<bool> FilterSameScaledReg(
"lsr-filter-same-scaled-reg", cl::Hidden, cl::init(true),		"lsr-filter-same-scaled-reg", cl::Hidden, cl::init(true),
cl::desc("Narrow LSR search space by filtering non-optimal formulae"		cl::desc("Narrow LSR search space by filtering non-optimal formulae"
" with the same ScaledReg and Scale"));		" with the same ScaledReg and Scale"));

		static cl::opt<bool> CollapseUnrolledCode(
		"lsr-collapse-unrolled", cl::Hidden, cl::init(false),
		gilrUnsubmitted Not Done Reply Inline Actions False by default? gilr: False by default?
		samparkerAuthorUnsubmitted Done Reply Inline Actions This option has been introduced to force the collapsing, even if the complexity limit hasn't been reached. Which is why it's implemented differently to the other, similar, options. samparker: This option has been introduced to force the collapsing, even if the complexity limit hasn't…
		gilrUnsubmitted Not Done Reply Inline Actions Aaargh, sorry, read it backwards ... Is it needed since unrolled code doesn't initially go into the same LSRUse? If so, can FavorBackedgeIndex be used to have initial construction put them in the same LSRUse? gilr: Aaargh, sorry, read it backwards ... Is it needed since unrolled code doesn't initially go into…
		samparkerAuthorUnsubmitted Done Reply Inline Actions I will look into this, thanks. samparker: I will look into this, thanks.
		cl::desc("Narrow LSR search space by collapsing unrolled code"));

		static cl::opt<bool> EnableBackedgeIndexing(
		"lsr-backedge-indexing", cl::Hidden, cl::init(true),
		cl::desc("Enable the generation of cross iteration post increments"));

static cl::opt<unsigned> ComplexityLimit(		static cl::opt<unsigned> ComplexityLimit(
"lsr-complexity-limit", cl::Hidden,		"lsr-complexity-limit", cl::Hidden,
cl::init(std::numeric_limits<uint16_t>::max()),		cl::init(std::numeric_limits<uint16_t>::max()),
cl::desc("LSR search space complexity limit"));		cl::desc("LSR search space complexity limit"));

#ifndef NDEBUG		#ifndef NDEBUG
// Stress test IV chain generation.		// Stress test IV chain generation.
static cl::opt<bool> StressIVChain(		static cl::opt<bool> StressIVChain(
▲ Show 20 Lines • Show All 882 Lines • ▼ Show 20 Lines	void RateFormula(const TargetTransformInfo &TTI,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const LSRUse &LU,		const LSRUse &LU,
SmallPtrSetImpl<const SCEV > LoserRegs = nullptr);		SmallPtrSetImpl<const SCEV > LoserRegs = nullptr);

void print(raw_ostream &OS) const;		void print(raw_ostream &OS) const;
void dump() const;		void dump() const;

private:		private:
void RateRegister(const SCEV *Reg,		void RateRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const TargetTransformInfo &TTI);		const TargetTransformInfo &TTI);
void RatePrimaryRegister(const SCEV *Reg,		void RatePrimaryRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
SmallPtrSetImpl<const SCEV > LoserRegs,		SmallPtrSetImpl<const SCEV > LoserRegs,
const TargetTransformInfo &TTI);		const TargetTransformInfo &TTI);
};		};

/// An operand value in an instruction which is to be replaced with some		/// An operand value in an instruction which is to be replaced with some
▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines

static bool isAMCompletelyFolded(const TargetTransformInfo &TTI,		static bool isAMCompletelyFolded(const TargetTransformInfo &TTI,
LSRUse::KindType Kind, MemAccessTy AccessTy,		LSRUse::KindType Kind, MemAccessTy AccessTy,
GlobalValue *BaseGV, int64_t BaseOffset,		GlobalValue *BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
Instruction *Fixup = nullptr);		Instruction *Fixup = nullptr);

/// Tally up interesting quantities from the given register.		/// Tally up interesting quantities from the given register.
void Cost::RateRegister(const SCEV *Reg,		void Cost::RateRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Reg)) {		if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Reg)) {
// If this is an addrec for another loop, it should be an invariant		// If this is an addrec for another loop, it should be an invariant
// with respect to L since L is the innermost loop (at least		// with respect to L since L is the innermost loop (at least
// for now LSR only handles innermost loops).		// for now LSR only handles innermost loops).
Show All 10 Lines	if (AR->getLoop() != L) {
}		}

// Otherwise, it will be an invariant with respect to Loop L.		// Otherwise, it will be an invariant with respect to Loop L.
++C.NumRegs;		++C.NumRegs;
return;		return;
}		}

unsigned LoopCost = 1;		unsigned LoopCost = 1;
		if (TTI.isIndexedLoadLegal(TTI.MIM_PostInc, AR->getType()) \|\|
		TTI.isIndexedStoreLegal(TTI.MIM_PostInc, AR->getType())) {

		// If the step size matches the base offset, we could use post increment
		// addressing so that the instruction then updates the pointer for its
		// own use in the next iteration.
		if (TTI.shouldFavorBackedgeIndex(L)) {
		if (auto *Step = dyn_cast<SCEVConstant>(AR->getStepRecurrence(SE))) {
		if (Step->getAPInt() == F.BaseOffset)
		LoopCost = 0;
		}
		}
		gilrUnsubmitted Not Done Reply Inline Actions The existing `{LI, +, C}` pattern seems to already match this case (i.e. `{(-C + %a + ...), +, C}`) and the more general case `{(Offset - C2) + %a + ...), +, C2}`, where Offset =/= 0. So IIUC matching this pattern here is only needed if the new TTI flag is set but the existing one is reset, right? Won't this also match something like `{3, +, 5, +, %x}`? gilr: - The existing `{LI, +, C}` pattern seems to already match this case (i.e. `{(-C + %a + ...), +…
		samparkerAuthorUnsubmitted Done Reply Inline Actions I'm going to need to replace this with something that compares the step with the base offset, aiming to have the post increment happen on the last, not the first, access. I need to make a few other changes to support this though as well. samparker: I'm going to need to replace this with something that compares the step with the base offset…

if (TTI.shouldFavorPostInc()) {		if (TTI.shouldFavorPostInc()) {
const SCEV *LoopStep = AR->getStepRecurrence(SE);		const SCEV *LoopStep = AR->getStepRecurrence(SE);
if (isa<SCEVConstant>(LoopStep)) {		if (isa<SCEVConstant>(LoopStep)) {
// Check if a post-indexed load/store can be used.
if (TTI.isIndexedLoadLegal(TTI.MIM_PostInc, AR->getType()) \|\|
TTI.isIndexedStoreLegal(TTI.MIM_PostInc, AR->getType())) {
const SCEV *LoopStart = AR->getStart();		const SCEV *LoopStart = AR->getStart();
if (!isa<SCEVConstant>(LoopStart) &&		if (!isa<SCEVConstant>(LoopStart) &&
SE.isLoopInvariant(LoopStart, L))		SE.isLoopInvariant(LoopStart, L))
LoopCost = 0;		LoopCost = 0;
}		}
}		}
}		}
C.AddRecCost += LoopCost;		C.AddRecCost += LoopCost;

// Add the step value register, if it needs one.		// Add the step value register, if it needs one.
// TODO: The non-affine case isn't precisely modeled here.		// TODO: The non-affine case isn't precisely modeled here.
if (!AR->isAffine() \|\| !isa<SCEVConstant>(AR->getOperand(1))) {		if (!AR->isAffine() \|\| !isa<SCEVConstant>(AR->getOperand(1))) {
if (!Regs.count(AR->getOperand(1))) {		if (!Regs.count(AR->getOperand(1))) {
RateRegister(AR->getOperand(1), Regs, L, SE, DT, TTI);		RateRegister(F, AR->getOperand(1), Regs, L, SE, DT, TTI);
if (isLoser())		if (isLoser())
return;		return;
}		}
}		}
}		}
++C.NumRegs;		++C.NumRegs;

// Rough heuristic; favor registers which don't require extra setup		// Rough heuristic; favor registers which don't require extra setup
// instructions in the preheader.		// instructions in the preheader.
if (!isa<SCEVUnknown>(Reg) &&		if (!isa<SCEVUnknown>(Reg) &&
!isa<SCEVConstant>(Reg) &&		!isa<SCEVConstant>(Reg) &&
!(isa<SCEVAddRecExpr>(Reg) &&		!(isa<SCEVAddRecExpr>(Reg) &&
(isa<SCEVUnknown>(cast<SCEVAddRecExpr>(Reg)->getStart()) \|\|		(isa<SCEVUnknown>(cast<SCEVAddRecExpr>(Reg)->getStart()) \|\|
isa<SCEVConstant>(cast<SCEVAddRecExpr>(Reg)->getStart()))))		isa<SCEVConstant>(cast<SCEVAddRecExpr>(Reg)->getStart()))))
++C.SetupCost;		++C.SetupCost;

C.NumIVMuls += isa<SCEVMulExpr>(Reg) &&		C.NumIVMuls += isa<SCEVMulExpr>(Reg) &&
SE.hasComputableLoopEvolution(Reg, L);		SE.hasComputableLoopEvolution(Reg, L);
}		}

/// Record this register in the set. If we haven't seen it before, rate		/// Record this register in the set. If we haven't seen it before, rate
/// it. Optional LoserRegs provides a way to declare any formula that refers to		/// it. Optional LoserRegs provides a way to declare any formula that refers to
/// one of those regs an instant loser.		/// one of those regs an instant loser.
void Cost::RatePrimaryRegister(const SCEV *Reg,		void Cost::RatePrimaryRegister(const Formula &F, const SCEV *Reg,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
SmallPtrSetImpl<const SCEV > LoserRegs,		SmallPtrSetImpl<const SCEV > LoserRegs,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (LoserRegs && LoserRegs->count(Reg)) {		if (LoserRegs && LoserRegs->count(Reg)) {
Lose();		Lose();
return;		return;
}		}
if (Regs.insert(Reg).second) {		if (Regs.insert(Reg).second) {
RateRegister(Reg, Regs, L, SE, DT, TTI);		RateRegister(F, Reg, Regs, L, SE, DT, TTI);
if (LoserRegs && isLoser())		if (LoserRegs && isLoser())
LoserRegs->insert(Reg);		LoserRegs->insert(Reg);
}		}
}		}

void Cost::RateFormula(const TargetTransformInfo &TTI,		void Cost::RateFormula(const TargetTransformInfo &TTI,
const Formula &F,		const Formula &F,
SmallPtrSetImpl<const SCEV *> &Regs,		SmallPtrSetImpl<const SCEV *> &Regs,
const DenseSet<const SCEV *> &VisitedRegs,		const DenseSet<const SCEV *> &VisitedRegs,
const Loop *L,		const Loop *L,
ScalarEvolution &SE, DominatorTree &DT,		ScalarEvolution &SE, DominatorTree &DT,
const LSRUse &LU,		const LSRUse &LU,
SmallPtrSetImpl<const SCEV > LoserRegs) {		SmallPtrSetImpl<const SCEV > LoserRegs) {
assert(F.isCanonical(*L) && "Cost is accurate only for canonical formula");		assert(F.isCanonical(*L) && "Cost is accurate only for canonical formula");
// Tally up the registers.		// Tally up the registers.
unsigned PrevAddRecCost = C.AddRecCost;		unsigned PrevAddRecCost = C.AddRecCost;
unsigned PrevNumRegs = C.NumRegs;		unsigned PrevNumRegs = C.NumRegs;
unsigned PrevNumBaseAdds = C.NumBaseAdds;		unsigned PrevNumBaseAdds = C.NumBaseAdds;
if (const SCEV *ScaledReg = F.ScaledReg) {		if (const SCEV *ScaledReg = F.ScaledReg) {
if (VisitedRegs.count(ScaledReg)) {		if (VisitedRegs.count(ScaledReg)) {
Lose();		Lose();
return;		return;
}		}
RatePrimaryRegister(ScaledReg, Regs, L, SE, DT, LoserRegs, TTI);		RatePrimaryRegister(F, ScaledReg, Regs, L, SE, DT, LoserRegs, TTI);
if (isLoser())		if (isLoser())
return;		return;
}		}
for (const SCEV *BaseReg : F.BaseRegs) {		for (const SCEV *BaseReg : F.BaseRegs) {
if (VisitedRegs.count(BaseReg)) {		if (VisitedRegs.count(BaseReg)) {
Lose();		Lose();
return;		return;
}		}
RatePrimaryRegister(BaseReg, Regs, L, SE, DT, LoserRegs, TTI);		RatePrimaryRegister(F, BaseReg, Regs, L, SE, DT, LoserRegs, TTI);
if (isLoser())		if (isLoser())
return;		return;
}		}

// Determine how many (unfolded) adds we'll need inside the loop.		// Determine how many (unfolded) adds we'll need inside the loop.
size_t NumBaseParts = F.getNumRegs();		size_t NumBaseParts = F.getNumRegs();
if (NumBaseParts > 1)		if (NumBaseParts > 1)
// Do not count the base and a possible second register if the target		// Do not count the base and a possible second register if the target
Show All 23 Lines	if (LU.Kind == LSRUse::Address && Offset != 0 &&
C.NumBaseAdds++;		C.NumBaseAdds++;
}		}

// If we don't count instruction cost exit here.		// If we don't count instruction cost exit here.
if (!InsnsCost) {		if (!InsnsCost) {
assert(isValid() && "invalid cost");		assert(isValid() && "invalid cost");
return;		return;
}		}

		gilrUnsubmitted Not Done Reply Inline Actions IIRC single-statement clauses shouldn't get curly braces. gilr: IIRC single-statement clauses shouldn't get curly braces.
// Treat every new register that exceeds TTI.getNumberOfRegisters() - 1 as		// Treat every new register that exceeds TTI.getNumberOfRegisters() - 1 as
// additional instruction (at least fill).		// additional instruction (at least fill).
unsigned TTIRegNum = TTI.getNumberOfRegisters(false) - 1;		unsigned TTIRegNum = TTI.getNumberOfRegisters(false) - 1;
if (C.NumRegs > TTIRegNum) {		if (C.NumRegs > TTIRegNum) {
// Cost already exceeded TTIRegNum, then only newly added register can add		// Cost already exceeded TTIRegNum, then only newly added register can add
// new instructions.		// new instructions.
if (PrevNumRegs > TTIRegNum)		if (PrevNumRegs > TTIRegNum)
C.Insns += (C.NumRegs - PrevNumRegs);		C.Insns += (C.NumRegs - PrevNumRegs);
▲ Show 20 Lines • Show All 488 Lines • ▼ Show 20 Lines	struct IVChain {
bool hasIncs() const { return Incs.size() >= 2; }		bool hasIncs() const { return Incs.size() >= 2; }

// Add an IVInc to the end of this chain.		// Add an IVInc to the end of this chain.
void add(const IVInc &X) { Incs.push_back(X); }		void add(const IVInc &X) { Incs.push_back(X); }

// Returns the last UserInst in the chain.		// Returns the last UserInst in the chain.
Instruction *tailUserInst() const { return Incs.back().UserInst; }		Instruction *tailUserInst() const { return Incs.back().UserInst; }

		Instruction *head() {
		return Incs.front().UserInst;
		}

		Instruction *lastNonPHI() {
		if (!isa<PHINode>(Incs.back().UserInst))
		return tailUserInst();
		if (Incs.size() < 2)
		return head();
		return Incs[Incs.size()-2].UserInst;
		}

// Returns true if IncExpr can be profitably added to this chain.		// Returns true if IncExpr can be profitably added to this chain.
bool isProfitableIncrement(const SCEV *OperExpr,		bool isProfitableIncrement(const SCEV *OperExpr,
const SCEV *IncExpr,		const SCEV *IncExpr,
ScalarEvolution&);		ScalarEvolution&);
};		};

/// Helper for CollectChains to track multiple IV increment uses. Distinguish		/// Helper for CollectChains to track multiple IV increment uses. Distinguish
/// between FarUsers that definitely cross IV increments and NearUsers that may		/// between FarUsers that definitely cross IV increments and NearUsers that may
/// be used between IV increments.		/// be used between IV increments.
struct ChainUsers {		struct ChainUsers {
SmallPtrSet<Instruction*, 4> FarUsers;		SmallPtrSet<Instruction*, 4> FarUsers;
SmallPtrSet<Instruction*, 4> NearUsers;		SmallPtrSet<Instruction*, 4> NearUsers;
};		};

/// This class holds state for the main loop strength reduction logic.		/// This class holds state for the main loop strength reduction logic.
class LSRInstance {		class LSRInstance {
IVUsers &IU;		IVUsers &IU;
ScalarEvolution &SE;		ScalarEvolution &SE;
DominatorTree &DT;		DominatorTree &DT;
LoopInfo &LI;		LoopInfo &LI;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
Loop *const L;		Loop *const L;
		bool FavorBackedgeIndex = false;
bool Changed = false;		bool Changed = false;

/// This is the insert position that the current loop's induction variable		/// This is the insert position that the current loop's induction variable
/// increment should be placed. In simple loops, this is the latch block's		/// increment should be placed. In simple loops, this is the latch block's
/// terminator. But in more complicated cases, this is a position which will		/// terminator. But in more complicated cases, this is a position which will
/// dominate all the in-loop post-increment users.		/// dominate all the in-loop post-increment users.
Instruction *IVIncInsertPos = nullptr;		Instruction *IVIncInsertPos = nullptr;

▲ Show 20 Lines • Show All 898 Lines • ▼ Show 20 Lines
///		///
/// Chaining IVs can lead to considerable code bloat if ISEL doesn't		/// Chaining IVs can lead to considerable code bloat if ISEL doesn't
/// effectively use postinc addressing modes. Only consider it profitable it the		/// effectively use postinc addressing modes. Only consider it profitable it the
/// increments can be computed in fewer registers when chained.		/// increments can be computed in fewer registers when chained.
///		///
/// TODO: Consider IVInc free if it's already used in another chains.		/// TODO: Consider IVInc free if it's already used in another chains.
static bool		static bool
isProfitableChain(IVChain &Chain, SmallPtrSetImpl<Instruction*> &Users,		isProfitableChain(IVChain &Chain, SmallPtrSetImpl<Instruction*> &Users,
ScalarEvolution &SE, const TargetTransformInfo &TTI) {		ScalarEvolution &SE, bool FavorBackedgeIndex) {
if (StressIVChain)		if (StressIVChain)
return true;		return true;

if (!Chain.hasIncs())		if (!Chain.hasIncs())
return false;		return false;

if (!Users.empty()) {		if (!Users.empty()) {
LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " users:\n";		LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " users:\n";
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	isProfitableChain(IVChain &Chain, SmallPtrSetImpl<Instruction*> &Users,

// Reusing variable increments likely saves a register to hold the multiple of		// Reusing variable increments likely saves a register to hold the multiple of
// the stride.		// the stride.
cost -= NumReusedIncrements;		cost -= NumReusedIncrements;

LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " Cost: " << cost		LLVM_DEBUG(dbgs() << "Chain: " << *Chain.Incs[0].UserInst << " Cost: " << cost
<< "\n");		<< "\n");

		if (FavorBackedgeIndex)
		gilrUnsubmitted Not Done Reply Inline Actions Deserves a comment. gilr: Deserves a comment.
		return cost <= 1;

return cost < 0;		return cost < 0;
}		}

/// Add this IV user to an existing chain or make it the head of a new chain.		/// Add this IV user to an existing chain or make it the head of a new chain.
void LSRInstance::ChainInstruction(Instruction UserInst, Instruction IVOper,		void LSRInstance::ChainInstruction(Instruction UserInst, Instruction IVOper,
SmallVectorImpl<ChainUsers> &ChainUsersVec) {		SmallVectorImpl<ChainUsers> &ChainUsersVec) {
// When IVs are used as types of varying widths, they are generally converted		// When IVs are used as types of varying widths, they are generally converted
// to a wider type with some uses remaining narrow under a (free) trunc.		// to a wider type with some uses remaining narrow under a (free) trunc.
▲ Show 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	for (PHINode &PN : L->getHeader()->phis()) {
if (IncV)		if (IncV)
ChainInstruction(&PN, IncV, ChainUsersVec);		ChainInstruction(&PN, IncV, ChainUsersVec);
}		}
// Remove any unprofitable chains.		// Remove any unprofitable chains.
unsigned ChainIdx = 0;		unsigned ChainIdx = 0;
for (unsigned UsersIdx = 0, NChains = IVChainVec.size();		for (unsigned UsersIdx = 0, NChains = IVChainVec.size();
UsersIdx < NChains; ++UsersIdx) {		UsersIdx < NChains; ++UsersIdx) {
if (!isProfitableChain(IVChainVec[UsersIdx],		if (!isProfitableChain(IVChainVec[UsersIdx],
ChainUsersVec[UsersIdx].FarUsers, SE, TTI))		ChainUsersVec[UsersIdx].FarUsers, SE,
		FavorBackedgeIndex))
continue;		continue;
// Preserve the chain at UsesIdx.		// Preserve the chain at UsesIdx.
if (ChainIdx != UsersIdx)		if (ChainIdx != UsersIdx)
IVChainVec[ChainIdx] = IVChainVec[UsersIdx];		IVChainVec[ChainIdx] = IVChainVec[UsersIdx];
FinalizeChain(IVChainVec[ChainIdx]);		FinalizeChain(IVChainVec[ChainIdx]);
++ChainIdx;		++ChainIdx;
}		}
IVChainVec.resize(ChainIdx);		IVChainVec.resize(ChainIdx);
}		}

void LSRInstance::FinalizeChain(IVChain &Chain) {		void LSRInstance::FinalizeChain(IVChain &Chain) {
assert(!Chain.Incs.empty() && "empty IV chains are not allowed");		assert(!Chain.Incs.empty() && "empty IV chains are not allowed");
LLVM_DEBUG(dbgs() << "Final Chain: " << *Chain.Incs[0].UserInst << "\n");		LLVM_DEBUG(dbgs() << "Final Chain: " << *Chain.head() << "\n");

for (const IVInc &Inc : Chain) {		for (const IVInc &Inc : Chain) {
LLVM_DEBUG(dbgs() << " Inc: " << *Inc.UserInst << "\n");		LLVM_DEBUG(dbgs() << " Inc: " << *Inc.UserInst << "\n");
auto UseI = find(Inc.UserInst->operands(), Inc.IVOperand);		auto UseI = find(Inc.UserInst->operands(), Inc.IVOperand);
assert(UseI != Inc.UserInst->op_end() && "cannot find IV operand");		assert(UseI != Inc.UserInst->op_end() && "cannot find IV operand");
		if (FavorBackedgeIndex && UseI->getUser() == Chain.lastNonPHI())
		gilrUnsubmitted Not Done Reply Inline Actions I'm a bit confused here. IIUC a profitable complete chain prvides an efficient solution using post-increments. Why break the last increment out? gilr: I'm a bit confused here. IIUC a profitable complete chain prvides an efficient solution using…
		samparkerAuthorUnsubmitted Done Reply Inline Actions This was so that the last access could also be a target for optimisation. I'll remove the change and do some testing again to see the difference. samparker: This was so that the last access could also be a target for optimisation. I'll remove the…
		continue;
IVIncSet.insert(UseI);		IVIncSet.insert(UseI);
}		}
}		}

/// Return true if the IVInc can be folded into an addressing mode.		/// Return true if the IVInc can be folded into an addressing mode.
static bool canFoldIVIncExpr(const SCEV IncExpr, Instruction UserInst,		static bool canFoldIVIncExpr(const SCEV IncExpr, Instruction UserInst,
Value *Operand, const TargetTransformInfo &TTI) {		Value *Operand, const TargetTransformInfo &TTI) {
const SCEVConstant *IncConst = dyn_cast<SCEVConstant>(IncExpr);		const SCEVConstant *IncConst = dyn_cast<SCEVConstant>(IncExpr);
▲ Show 20 Lines • Show All 639 Lines • ▼ Show 20 Lines	if (Base.Scale == 1)
GenerateSymbolicOffsetsImpl(LU, LUIdx, Base, /* Idx */ -1,		GenerateSymbolicOffsetsImpl(LU, LUIdx, Base, /* Idx */ -1,
/* IsScaledReg */ true);		/* IsScaledReg */ true);
}		}

/// Helper function for LSRInstance::GenerateConstantOffsets.		/// Helper function for LSRInstance::GenerateConstantOffsets.
void LSRInstance::GenerateConstantOffsetsImpl(		void LSRInstance::GenerateConstantOffsetsImpl(
LSRUse &LU, unsigned LUIdx, const Formula &Base,		LSRUse &LU, unsigned LUIdx, const Formula &Base,
const SmallVectorImpl<int64_t> &Worklist, size_t Idx, bool IsScaledReg) {		const SmallVectorImpl<int64_t> &Worklist, size_t Idx, bool IsScaledReg) {
const SCEV *G = IsScaledReg ? Base.ScaledReg : Base.BaseRegs[Idx];
for (int64_t Offset : Worklist) {		auto GenerateOffset = [&](const SCEV *G, int64_t Offset) {
Formula F = Base;		Formula F = Base;
F.BaseOffset = (uint64_t)Base.BaseOffset - Offset;		F.BaseOffset = (uint64_t)Base.BaseOffset - Offset;

if (isLegalUse(TTI, LU.MinOffset - Offset, LU.MaxOffset - Offset, LU.Kind,		if (isLegalUse(TTI, LU.MinOffset - Offset, LU.MaxOffset - Offset, LU.Kind,
LU.AccessTy, F)) {		LU.AccessTy, F)) {
// Add the offset to the base register.		// Add the offset to the base register.
const SCEV *NewG = SE.getAddExpr(SE.getConstant(G->getType(), Offset), G);		const SCEV *NewG = SE.getAddExpr(SE.getConstant(G->getType(), Offset), G);
// If it cancelled out, drop the base register, otherwise update it.		// If it cancelled out, drop the base register, otherwise update it.
if (NewG->isZero()) {		if (NewG->isZero()) {
if (IsScaledReg) {		if (IsScaledReg) {
F.Scale = 0;		F.Scale = 0;
F.ScaledReg = nullptr;		F.ScaledReg = nullptr;
} else		} else
F.deleteBaseReg(F.BaseRegs[Idx]);		F.deleteBaseReg(F.BaseRegs[Idx]);
		gilrUnsubmitted Not Done Reply Inline Actions Is this relevant for non-Address kind formulae? gilr: Is this relevant for non-Address kind formulae?
		samparkerAuthorUnsubmitted Done Reply Inline Actions No. I had naively assumed that this function was only generating formulae for addresses... so if that's not the case, I'll make the change here. samparker: No. I had naively assumed that this function was only generating formulae for addresses... so…
		gilrUnsubmitted Not Done Reply Inline Actions Worth a comment here to describe the motivation and how adjusting the offset generates the post-inc opportunity. gilr: Worth a comment here to describe the motivation and how adjusting the offset generates the post…
F.canonicalize(*L);		F.canonicalize(*L);
} else if (IsScaledReg)		} else if (IsScaledReg)
		gilrUnsubmitted Not Done Reply Inline Actions How does a non-constant step (e.g. 50 + %x) translate to post-inc? Could you add such a test case? gilr: How does a non-constant step (e.g. 50 + %x) translate to post-inc? Could you add such a test…
		samparkerAuthorUnsubmitted Done Reply Inline Actions Ah, good catch. This should only trigger for constant steps. samparker: Ah, good catch. This should only trigger for constant steps.
F.ScaledReg = NewG;		F.ScaledReg = NewG;
else		else
F.BaseRegs[Idx] = NewG;		F.BaseRegs[Idx] = NewG;

(void)InsertFormula(LU, LUIdx, F);		(void)InsertFormula(LU, LUIdx, F);
}		}
		};
		gilrUnsubmitted Not Done Reply Inline Actions The new TTI API relates to the same HW feature, right? Why not use a cl::opt in LSR to turn this optimization off? gilr: The new TTI API relates to the same HW feature, right? Why not use a cl::opt in LSR to turn…
		samparkerAuthorUnsubmitted Done Reply Inline Actions When I've got the next patch ready, I will re-run my tests and see if this is viable. It may not be good for all targets, and if so, I think a target hook would be more useful. samparker: When I've got the next patch ready, I will re-run my tests and see if this is viable. It may…

		const SCEV *G = IsScaledReg ? Base.ScaledReg : Base.BaseRegs[Idx];

		// With constant offsets and constant steps, we can generate post index
		// accesses by having the offset equal the step. So, for access #0 with a
		// step of 8, we could generate a G - 8 base which would require the first
		// access to be ((G - 8) + 8),+,8. The post-indexed access would then update
		// the pointer for itself in the next iteration.
		if (FavorBackedgeIndex && LU.Kind == LSRUse::Address) {
		if (auto *GAR = dyn_cast<SCEVAddRecExpr>(G)) {
		if (auto *StepRec =
		dyn_cast<SCEVConstant>(GAR->getStepRecurrence(SE))) {
		const APInt &StepInt = StepRec->getAPInt();
		int64_t Step = StepInt.isNegative() ?
		StepInt.getSExtValue() : StepInt.getZExtValue();

		for (int64_t Offset : Worklist) {
		Offset -= Step;
		GenerateOffset(G, Offset);
		}
}		}
		}
		}
		for (int64_t Offset : Worklist)
		GenerateOffset(G, Offset);

int64_t Imm = ExtractImmediate(G, SE);		int64_t Imm = ExtractImmediate(G, SE);
if (G->isZero() \|\| Imm == 0)		if (G->isZero() \|\| Imm == 0)
return;		return;
Formula F = Base;		Formula F = Base;
F.BaseOffset = (uint64_t)F.BaseOffset + Imm;		F.BaseOffset = (uint64_t)F.BaseOffset + Imm;
if (!isLegalUse(TTI, LU.MinOffset, LU.MaxOffset, LU.Kind, LU.AccessTy, F))		if (!isLegalUse(TTI, LU.MinOffset, LU.MaxOffset, LU.Kind, LU.AccessTy, F))
return;		return;
▲ Show 20 Lines • Show All 622 Lines • ▼ Show 20 Lines	if (EstimateSearchSpaceComplexity() >= ComplexityLimit) {

LLVM_DEBUG(dbgs() << "After pre-selection:\n"; print_uses(dbgs()));		LLVM_DEBUG(dbgs() << "After pre-selection:\n"; print_uses(dbgs()));
}		}
}		}

/// When there are many registers for expressions like A, A+1, A+2, etc.,		/// When there are many registers for expressions like A, A+1, A+2, etc.,
/// allocate a single register for them.		/// allocate a single register for them.
void LSRInstance::NarrowSearchSpaceByCollapsingUnrolledCode() {		void LSRInstance::NarrowSearchSpaceByCollapsingUnrolledCode() {
if (EstimateSearchSpaceComplexity() < ComplexityLimit)		if (!CollapseUnrolledCode &&
		EstimateSearchSpaceComplexity() < ComplexityLimit)
		gilrUnsubmitted Not Done Reply Inline Actions The convention for flags controlling the narrowing heuristics seem to be to use them in NarrowSearchSpaceUsingHeuristics() rather than in the the functions they affect. gilr: The convention for flags controlling the narrowing heuristics seem to be to use them in…
		gilrUnsubmitted Not Done Reply Inline Actions I think you meant '\|\|' here. gilr: I think you meant '\|\|' here.
return;		return;

LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "The search space is too complex.\n"		dbgs() << "The search space is too complex.\n"
"Narrowing the search space by assuming that uses separated "		"Narrowing the search space by assuming that uses separated "
"by a constant offset will use the same registers.\n");		"by a constant offset will use the same registers.\n");

// This is especially useful for unrolled loops.		// This is especially useful for unrolled loops.
▲ Show 20 Lines • Show All 944 Lines • ▼ Show 20 Lines	#endif
Rewriter.clear();		Rewriter.clear();

Changed \|= DeleteTriviallyDeadInstructions(DeadInsts);		Changed \|= DeleteTriviallyDeadInstructions(DeadInsts);
}		}

LSRInstance::LSRInstance(Loop *L, IVUsers &IU, ScalarEvolution &SE,		LSRInstance::LSRInstance(Loop *L, IVUsers &IU, ScalarEvolution &SE,
DominatorTree &DT, LoopInfo &LI,		DominatorTree &DT, LoopInfo &LI,
const TargetTransformInfo &TTI)		const TargetTransformInfo &TTI)
: IU(IU), SE(SE), DT(DT), LI(LI), TTI(TTI), L(L) {		: IU(IU), SE(SE), DT(DT), LI(LI), TTI(TTI), L(L),
		FavorBackedgeIndex(EnableBackedgeIndexing &&
		TTI.shouldFavorBackedgeIndex(L)) {
// If LoopSimplify form is not available, stay out of trouble.		// If LoopSimplify form is not available, stay out of trouble.
if (!L->isLoopSimplifyForm())		if (!L->isLoopSimplifyForm())
return;		return;

// If there's no interesting work to be done, bail early.		// If there's no interesting work to be done, bail early.
if (IU.empty()) return;		if (IU.empty()) return;

// If there's too much analysis to be done, bail early. We won't be able to		// If there's too much analysis to be done, bail early. We won't be able to
▲ Show 20 Lines • Show All 258 Lines • Show Last 20 Lines

test/CodeGen/ARM/dsp-loop-indexing.ll

This file was added.

				; RUN: llc -mtriple=thumbv7em -mattr=+fp-armv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-backedge-postincs=false %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-complexity-limit=2147483647 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-COMPLEX
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-complexity-limit=2147483647 -lsr-collapse-unrolled=true %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT

				; CHECK-LABEL: test_qadd_2
				; CHECK: @ %loop
				; TODO: pre-inc str

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #8]!
				; CHECK-DEAFULT: ldr{{.*}}, #8]!
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: add{{.*}}, #8

				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: str{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #4]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_2(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = or i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = add nsw nuw i32 %idx.1, 2
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd_2_backwards
				; TODO: Post increments should be generated.

				; CHECK: @ %loop

				; CHECK-DEFAULT: ldr{{.*}},
				; CHECK-DEFAULT: ldr{{.*}},
				; CHECK-DEFAULT: str{{.*}},
				; CHECK-DEFAULT: ldr{{.*}}, #-4]
				; CHECK-DEFAULT: ldr{{.*}}, #-4]
				; CHECK-DEFAULT: sub{{.*}}, #8
				; CHECK-DEFAULT: str{{.*}}, #-4]
				; CHECK-DEFAULT: sub{{.*}}, #8

				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: str{{.*}} lsl #2]
				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: ldr{{.*}} lsl #2]
				; CHECK-COMPLEX: str{{.*}} lsl #2]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_2_backwards(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ %N, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = sub nsw nuw i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = sub nsw nuw i32 %idx.1, 2
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd_3
				; CHECK: @ %loop

				; TODO: pre-inc str

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #12]!
				; CHECK-DEFAULT: ldr{{.*}}, #12]!
				; CHECK-DEFAULT: str{{.*}}, #12]
				; CHECK-DEFAULT: add{{.*}}, #12

				; CHECK-COMPLEX: ldr{{.*}}, #12]!
				; CHECK-COMPLEX: ldr{{.*}}, #12]!
				; CHECK-COMPLEX: str{{.*}}, #12]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: str{{.*}}, #8]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_3(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = add nuw nsw i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%idx.3 = add nuw nsw i32 %idx.1, 2
				%gep.a.3 = getelementptr inbounds i32, i32* %a.array, i32 %idx.3
				%a.3 = load i32, i32* %gep.a.3
				%gep.b.3 = getelementptr inbounds i32, i32* %b.array, i32 %idx.3
				%b.3 = load i32, i32* %gep.b.3
				%qadd.3 = call i32 @llvm.arm.qadd(i32 %a.3, i32 %b.3)
				%addr.3 = getelementptr inbounds i32, i32* %out.array, i32 %idx.3
				store i32 %qadd.3, i32* %addr.3
				%i.next = add nsw nuw i32 %i, -3
				%idx.next = add nsw nuw i32 %idx.1, 3
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd_4
				; CHECK: @ %loop

				; TODO: pre-inc store

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: str{{.*}}, #12]
				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: str{{.*}}, #16]
				; CHECK-DEFAULT: add{{.*}}, #16

				; CHECK-COMPLEX: ldr{{.*}}, #16]!
				; CHECK-COMPLEX: ldr{{.*}}, #16]!
				; CHECK-COMPLEX: str{{.*}}, #16]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: ldr{{.*}}, #8]
				; CHECK-COMPLEX: str{{.*}}, #8]
				; CHECK-COMPLEX: ldr{{.*}}, #12]
				; CHECK-COMPLEX: ldr{{.*}}, #12]
				; CHECK-COMPLEX: str{{.*}}, #12]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd_4(i32* %a.array, i32* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i32, i32* %a.array, i32 %idx.1
				%a.1 = load i32, i32* %gep.a.1
				%gep.b.1 = getelementptr inbounds i32, i32* %b.array, i32 %idx.1
				%b.1 = load i32, i32* %gep.b.1
				%qadd.1 = call i32 @llvm.arm.qadd(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = or i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds i32, i32* %a.array, i32 %idx.2
				%a.2 = load i32, i32* %gep.a.2
				%gep.b.2 = getelementptr inbounds i32, i32* %b.array, i32 %idx.2
				%b.2 = load i32, i32* %gep.b.2
				%qadd.2 = call i32 @llvm.arm.qadd(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%idx.3 = or i32 %idx.1, 2
				%gep.a.3 = getelementptr inbounds i32, i32* %a.array, i32 %idx.3
				%a.3 = load i32, i32* %gep.a.3
				%gep.b.3 = getelementptr inbounds i32, i32* %b.array, i32 %idx.3
				%b.3 = load i32, i32* %gep.b.3
				%qadd.3 = call i32 @llvm.arm.qadd(i32 %a.3, i32 %b.3)
				%addr.3 = getelementptr inbounds i32, i32* %out.array, i32 %idx.3
				store i32 %qadd.3, i32* %addr.3
				%idx.4 = or i32 %idx.1, 3
				%gep.a.4 = getelementptr inbounds i32, i32* %a.array, i32 %idx.4
				%a.4 = load i32, i32* %gep.a.4
				%gep.b.4 = getelementptr inbounds i32, i32* %b.array, i32 %idx.4
				%b.4 = load i32, i32* %gep.b.4
				%qadd.4 = call i32 @llvm.arm.qadd(i32 %a.4, i32 %b.4)
				%addr.4 = getelementptr inbounds i32, i32* %out.array, i32 %idx.4
				store i32 %qadd.4, i32* %addr.4
				%i.next = add nsw nuw i32 %i, -4
				%idx.next = add nsw nuw i32 %idx.1, 4
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test_qadd16_2
				; CHECK: @ %loop
				; TODO: pre-inc store.

				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]!
				; CHECK-DEFAULT: ldr{{.*}}, #8]!
				; CHECK-DEFAULT: str{{.*}}, #16]
				; CHECK-DEFAULT: add{{.*}}, #16

				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: str{{.*}}, #16]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #8]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @test_qadd16_2(i16* %a.array, i16* %b.array, i32* %out.array, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%gep.a.1 = getelementptr inbounds i16, i16* %a.array, i32 %idx.1
				%cast.a.1 = bitcast i16* %gep.a.1 to i32*
				%a.1 = load i32, i32* %cast.a.1
				%gep.b.1 = getelementptr inbounds i16, i16* %b.array, i32 %idx.1
				%cast.b.1 = bitcast i16* %gep.b.1 to i32*
				%b.1 = load i32, i32* %cast.b.1
				%qadd.1 = call i32 @llvm.arm.qadd16(i32 %a.1, i32 %b.1)
				%addr.1 = getelementptr inbounds i32, i32* %out.array, i32 %idx.1
				store i32 %qadd.1, i32* %addr.1
				%idx.2 = add nsw nuw i32 %idx.1, 2
				%gep.a.2 = getelementptr inbounds i16, i16* %a.array, i32 %idx.2
				%cast.a.2 = bitcast i16* %gep.a.2 to i32*
				%a.2 = load i32, i32* %cast.a.2
				%gep.b.2 = getelementptr inbounds i16, i16* %b.array, i32 %idx.2
				%cast.b.2 = bitcast i16* %gep.b.2 to i32*
				%b.2 = load i32, i32* %cast.b.2
				%qadd.2 = call i32 @llvm.arm.qadd16(i32 %a.2, i32 %b.2)
				%addr.2 = getelementptr inbounds i32, i32* %out.array, i32 %idx.2
				store i32 %qadd.2, i32* %addr.2
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = add nsw nuw i32 %idx.1, 4
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret void
				}

				declare i32 @llvm.arm.qadd(i32, i32)
				declare i32 @llvm.arm.qadd16(i32, i32)

test/CodeGen/ARM/loop-align-cortex-m.ll

	; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m3 -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m3 -o - \| FileCheck %s
	; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m4 -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m4 -o - \| FileCheck %s
	; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m33 -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv8m-none-eabi %s -mcpu=cortex-m33 -o - \| FileCheck %s
				gilrUnsubmitted Not Done Reply Inline Actions You're just fixing this by the way, right? gilr: You're just fixing this by the way, right?
				samparkerAuthorUnsubmitted Done Reply Inline Actions yes samparker: yes

	define void @test_loop_alignment(i32* %in, i32* %out) optsize {			define void @test_loop_alignment(i32* %in, i32* %out) optsize {
	; CHECK-LABEL: test_loop_alignment:			; CHECK-LABEL: test_loop_alignment:
	; CHECK: movs {{r[0-9]+}}, #0			; CHECK: mov{{.*}}, #0
	; CHECK: .p2align 2			; CHECK: .p2align 2

	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]			%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
	%in.addr = getelementptr inbounds i32, i32* %in, i32 %i			%in.addr = getelementptr inbounds i32, i32* %in, i32 %i
	Show All 34 Lines

test/CodeGen/ARM/loop-indexing.ll

This file was added.

				; RUN: llc -mtriple=thumbv7em -mattr=+fp-armv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-BASE --check-prefix=CHECK-DEFAULT --check-prefix=CHECK-T2
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT --check-prefix=CHECK-T2
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-backedge-postincs=false %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8m.base %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=DISABLED
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-complexity-limit=2147483647 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-COMPLEX --check-prefix=CHECK-T2
				; RUN: llc -mtriple=thumbv8m.main -mattr=+fp-armv8,+dsp -lsr-complexity-limit=2147483647 -lsr-collapse-unrolled=true %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-DEFAULT --check-prefix=CHECK-COLLAPSE --check-prefix=CHECK-T2

				; Tests to check that post increment addressing modes are used instead of
				; updating base pointers with add instructions.

				; TODO: I think we should be able to use post inc addressing with VLDM
				; instructions.
				; CHECK-LABEL: test_fma
				; CHECK: @ %loop

				; CHECK-BASE: vldr s{{.*}}, #8]
				; CHECK-BASE: vldr s{{.*}}, #8]
				; CHECK-BASE: vldr s{{.*}}, #12]
				; CHECK-BASE: vldr s{{.*}}, #12]

				; CHECK-COMPLEX: vldr s{{.*}}, #8]
				; CHECK-COMPLEX: vldr s{{.*}}, #8]
				; CHECK-COMPLEX: vldr s{{.*}}, #12]
				; CHECK-COMPLEX: vldr s{{.*}}, #12]

				; CHECK-COLLAPSE: vldr s{{.*}}, #4]
				; CHECK-COLLAPSE: vldr s{{.*}}, #4]
				; CHECK-COLLAPSE: vldr s{{.*}}, #8]
				; CHECK-COLLAPSE: vldr s{{.*}}, #8]
				define float @test_fma(float* %a, float* %b, i32 %N) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%idx.1 = phi i32 [ 0, %entry ], [ %idx.next, %loop ]
				%res = phi float [ 0.0, %entry ], [ %fma.2, %loop ]
				%gep.a.1 = getelementptr inbounds float, float* %a, i32 %idx.1
				%a.1 = load float, float* %gep.a.1
				%gep.b.1 = getelementptr inbounds float, float* %b, i32 %idx.1
				%b.1 = load float, float* %gep.b.1
				%fmul.1 = fmul float %a.1, %b.1
				%fma.1 = fadd float %fmul.1, %res
				%idx.2 = or i32 %idx.1, 1
				%gep.a.2 = getelementptr inbounds float, float* %a, i32 %idx.2
				%a.2 = load float, float* %gep.a.2
				%gep.b.2 = getelementptr inbounds float, float* %b, i32 %idx.2
				%b.2 = load float, float* %gep.b.2
				%fmul.2 = fmul float %a.2, %b.2
				%fma.2 = fadd float %fmul.2, %fma.1
				%i.next = add nsw nuw i32 %i, -2
				%idx.next = add nsw nuw i32 %idx.1, 2
				%cmp = icmp ult i32 %i.next, %N
				br i1 %cmp, label %loop, label %exit

				exit:
				ret float %fma.2
				}

				; CHECK-LABEL: convolve_16bit

				; TODO: Generate pre-incs without higher complexity limit
				; CHECK-DEFAULT: ldr{{.*}}, #6]
				; CHECK-DEFAULT: ldr{{.*}}, #6]
				; CHECK-DEFAULT: ldr{{.*}}, #2]
				; CHECK-DEFAULT: ldr{{.*}}, #2]

				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #8]!
				; CHECK-COMPLEX: ldr{{.*}}, #4]
				; CHECK-COMPLEX: ldr{{.*}}, #4]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				define void @convolve_16bit(i16 nocapture readonly %input_image, i16 nocapture readonly %filter,
				i32 %filter_dim, i32 %out_width, i32 %out_height,
				i32** nocapture readonly %convolved) {
				entry:
				%cmp92 = icmp eq i32 %out_height, 0
				br i1 %cmp92, label %for.cond.cleanup, label %for.cond1.preheader.lr.ph

				for.cond1.preheader.lr.ph: ; preds = %entry
				%xtraiter = and i32 %filter_dim, 3
				%unroll_iter = sub i32 %filter_dim, %xtraiter
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.cond.cleanup3, %for.cond1.preheader.lr.ph
				%res_y.093 = phi i32 [ 0, %for.cond1.preheader.lr.ph ], [ %add28, %for.cond.cleanup3 ]
				%arrayidx22 = getelementptr inbounds i32, i32* %convolved, i32 %res_y.093
				%tmp3 = load i32, i32* %arrayidx22, align 4
				br label %for.cond9.preheader.us.us.preheader

				for.cond9.preheader.us.us.preheader: ; preds = %for.cond5.for.cond.cleanup7_crit_edge.us, %for.cond5.preheader.lr.ph
				%res_x.060.us = phi i32 [ %add25.us, %for.cond5.for.cond.cleanup7_crit_edge.us ], [ 0, %for.cond1.preheader ]
				br label %for.cond9.preheader.us.us

				for.cond9.preheader.us.us: ; preds = %for.cond9.for.cond.cleanup11_crit_edge.us.us, %for.cond9.preheader.us.us.preheader
				%filter_y.056.us.us = phi i32 [ %inc20.us.us, %for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa ], [ 0, %for.cond9.preheader.us.us.preheader ]
				%result_element.055.us.us = phi i32 [ %add18.us.us.3, %for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa ], [ 0, %for.cond9.preheader.us.us.preheader ]
				%add.us.us = add i32 %filter_y.056.us.us, %res_y.093
				%arrayidx.us.us = getelementptr inbounds i16, i16* %filter, i32 %filter_y.056.us.us
				%tmp5 = load i16, i16* %arrayidx.us.us, align 4
				%arrayidx15.us.us = getelementptr inbounds i16, i16* %input_image, i32 %add.us.us
				%tmp6 = load i16, i16* %arrayidx15.us.us, align 4
				br label %for.body12.us.us

				for.body12.us.us: ; preds = %for.body12.us.us, %for.cond9.preheader.us.us
				%filter_x.053.us.us = phi i32 [ %inc.us.us.3, %for.body12.us.us ], [ 0, %for.cond9.preheader.us.us ]
				%result_element.152.us.us = phi i32 [ %add18.us.us.3, %for.body12.us.us ], [ %result_element.055.us.us, %for.cond9.preheader.us.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body12.us.us ], [ %unroll_iter, %for.cond9.preheader.us.us ]
				%add13.us.us = add i32 %filter_x.053.us.us, %res_x.060.us
				%arrayidx14.us.us = getelementptr inbounds i16, i16* %tmp5, i32 %filter_x.053.us.us
				%tmp9 = load i16, i16* %arrayidx14.us.us, align 2
				%conv.us.us = sext i16 %tmp9 to i32
				%arrayidx16.us.us = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us
				%tmp10 = load i16, i16* %arrayidx16.us.us, align 2
				%conv17.us.us = sext i16 %tmp10 to i32
				%mul.us.us = mul nsw i32 %conv17.us.us, %conv.us.us
				%add18.us.us = add nsw i32 %mul.us.us, %result_element.152.us.us
				%inc.us.us = or i32 %filter_x.053.us.us, 1
				%add13.us.us.1 = add i32 %inc.us.us, %res_x.060.us
				%arrayidx14.us.us.1 = getelementptr inbounds i16, i16* %tmp5, i32 %inc.us.us
				%tmp11 = load i16, i16* %arrayidx14.us.us.1, align 2
				%conv.us.us.1 = sext i16 %tmp11 to i32
				%arrayidx16.us.us.1 = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us.1
				%tmp12 = load i16, i16* %arrayidx16.us.us.1, align 2
				%conv17.us.us.1 = sext i16 %tmp12 to i32
				%mul.us.us.1 = mul nsw i32 %conv17.us.us.1, %conv.us.us.1
				%add18.us.us.1 = add nsw i32 %mul.us.us.1, %add18.us.us
				%inc.us.us.1 = or i32 %filter_x.053.us.us, 2
				%add13.us.us.2 = add i32 %inc.us.us.1, %res_x.060.us
				%arrayidx14.us.us.2 = getelementptr inbounds i16, i16* %tmp5, i32 %inc.us.us.1
				%tmp13 = load i16, i16* %arrayidx14.us.us.2, align 2
				%conv.us.us.2 = sext i16 %tmp13 to i32
				%arrayidx16.us.us.2 = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us.2
				%tmp14 = load i16, i16* %arrayidx16.us.us.2, align 2
				%conv17.us.us.2 = sext i16 %tmp14 to i32
				%mul.us.us.2 = mul nsw i32 %conv17.us.us.2, %conv.us.us.2
				%add18.us.us.2 = add nsw i32 %mul.us.us.2, %add18.us.us.1
				%inc.us.us.2 = or i32 %filter_x.053.us.us, 3
				%add13.us.us.3 = add i32 %inc.us.us.2, %res_x.060.us
				%arrayidx14.us.us.3 = getelementptr inbounds i16, i16* %tmp5, i32 %inc.us.us.2
				%tmp15 = load i16, i16* %arrayidx14.us.us.3, align 2
				%conv.us.us.3 = sext i16 %tmp15 to i32
				%arrayidx16.us.us.3 = getelementptr inbounds i16, i16* %tmp6, i32 %add13.us.us.3
				%tmp16 = load i16, i16* %arrayidx16.us.us.3, align 2
				%conv17.us.us.3 = sext i16 %tmp16 to i32
				%mul.us.us.3 = mul nsw i32 %conv17.us.us.3, %conv.us.us.3
				%add18.us.us.3 = add nsw i32 %mul.us.us.3, %add18.us.us.2
				%inc.us.us.3 = add i32 %filter_x.053.us.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa, label %for.body12.us.us

				for.cond9.for.cond.cleanup11_crit_edge.us.us.unr-lcssa: ; preds = %for.body12.us.us, %for.cond9.preheader.us.us
				%inc20.us.us = add nuw i32 %filter_y.056.us.us, 1
				%exitcond98 = icmp eq i32 %inc20.us.us, %filter_dim
				br i1 %exitcond98, label %for.cond5.for.cond.cleanup7_crit_edge.us, label %for.cond9.preheader.us.us

				for.cond5.for.cond.cleanup7_crit_edge.us: ; preds = %for.cond9.for.cond.cleanup11_crit_edge.us.us
				%arrayidx23.us = getelementptr inbounds i32, i32* %tmp3, i32 %res_x.060.us
				store i32 %add18.us.us.3, i32* %arrayidx23.us, align 4
				%add25.us = add nuw i32 %res_x.060.us, 1
				%exitcond99 = icmp eq i32 %add25.us, %out_width
				br i1 %exitcond99, label %for.cond.cleanup3, label %for.cond9.preheader.us.us.preheader

				for.cond.cleanup3: ; preds = %for.cond5.for.cond.cleanup7_crit_edge.us, %for.cond5.preheader.preheader, %for.cond1.preheader
				%add28 = add nuw i32 %res_y.093, 1
				%exitcond100 = icmp eq i32 %add28, %out_height
				br i1 %exitcond100, label %for.cond.cleanup, label %for.cond1.preheader

				for.cond.cleanup: ; preds = %for.cond.cleanup3, %entry
				ret void
				}

				; CHECK-LABEL: mul_8x8
				; CHECK: @ %for.body
				; TODO: pre-inc store.

				; CHECK-DEFAULT: ldrb{{.*}}, #1]
				; CHECK-DEFAULT: ldrb{{.*}}, #1]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldrb{{.*}}, #2]
				; CHECK-DEFAULT: ldrb{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldrb{{.*}}, #3]
				; CHECK-DEFAULT: ldrb{{.*}}, #3]
				; CHECK-DEFAULT: str{{.*}}, #12]
				; CHECK-DEFAULT: ldrb{{.*}}, #4]!
				; CHECK-DEFAULT: ldrb{{.*}}, #4]!
				; CHECK-DEFAULT: str{{.*}}, #16]
				; CHECK-DEFAULT: add{{.*}}, #16

				; CHECK-COMPLEX: ldrb
				; CHECK-COMPLEX: ldrb
				; CHECK-COMPLEX: str
				; CHECK-COMPLEX: ldrb{{.*}}, #1]
				; CHECK-COMPLEX: ldrb{{.*}}, #1]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldrb{{.*}}, #2]
				; CHECK-COMPLEX: ldrb{{.*}}, #2]
				; CHECK-COMPLEX: str{{.*}}, #8]
				; CHECK-COMPLEX: ldrb{{.*}}, #3]
				; CHECK-COMPLEX: ldrb{{.*}}, #3]
				; CHECK-COMPLEX: str{{.*}}, #12]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul_8x8(i8* nocapture readonly %A, i8* nocapture readonly %B, i32* nocapture %C, i32 %N) {
				entry:
				%cmp9 = icmp eq i32 %N, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.010.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.010.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.010.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i8, i8* %A, i32 %i.010.epil
				%tmp2 = load i8, i8* %arrayidx.epil, align 1
				%conv.epil = zext i8 %tmp2 to i32
				%arrayidx1.epil = getelementptr inbounds i8, i8* %B, i32 %i.010.epil
				%tmp3 = load i8, i8* %arrayidx1.epil, align 1
				%conv2.epil = zext i8 %tmp3 to i32
				%mul.epil = mul nuw nsw i32 %conv2.epil, %conv.epil
				%arrayidx3.epil = getelementptr inbounds i32, i32* %C, i32 %i.010.epil
				store i32 %mul.epil, i32* %arrayidx3.epil, align 4
				%inc.epil = add nuw i32 %i.010.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.010 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %A, i32 %i.010
				%tmp4 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %tmp4 to i32
				%arrayidx1 = getelementptr inbounds i8, i8* %B, i32 %i.010
				%tmp5 = load i8, i8* %arrayidx1, align 1
				%conv2 = zext i8 %tmp5 to i32
				%mul = mul nuw nsw i32 %conv2, %conv
				%arrayidx3 = getelementptr inbounds i32, i32* %C, i32 %i.010
				store i32 %mul, i32* %arrayidx3, align 4
				%inc = or i32 %i.010, 1
				%arrayidx.1 = getelementptr inbounds i8, i8* %A, i32 %inc
				%tmp6 = load i8, i8* %arrayidx.1, align 1
				%conv.1 = zext i8 %tmp6 to i32
				%arrayidx1.1 = getelementptr inbounds i8, i8* %B, i32 %inc
				%tmp7 = load i8, i8* %arrayidx1.1, align 1
				%conv2.1 = zext i8 %tmp7 to i32
				%mul.1 = mul nuw nsw i32 %conv2.1, %conv.1
				%arrayidx3.1 = getelementptr inbounds i32, i32* %C, i32 %inc
				store i32 %mul.1, i32* %arrayidx3.1, align 4
				%inc.1 = or i32 %i.010, 2
				%arrayidx.2 = getelementptr inbounds i8, i8* %A, i32 %inc.1
				%tmp8 = load i8, i8* %arrayidx.2, align 1
				%conv.2 = zext i8 %tmp8 to i32
				%arrayidx1.2 = getelementptr inbounds i8, i8* %B, i32 %inc.1
				%tmp9 = load i8, i8* %arrayidx1.2, align 1
				%conv2.2 = zext i8 %tmp9 to i32
				%mul.2 = mul nuw nsw i32 %conv2.2, %conv.2
				%arrayidx3.2 = getelementptr inbounds i32, i32* %C, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx3.2, align 4
				%inc.2 = or i32 %i.010, 3
				%arrayidx.3 = getelementptr inbounds i8, i8* %A, i32 %inc.2
				%tmp10 = load i8, i8* %arrayidx.3, align 1
				%conv.3 = zext i8 %tmp10 to i32
				%arrayidx1.3 = getelementptr inbounds i8, i8* %B, i32 %inc.2
				%tmp11 = load i8, i8* %arrayidx1.3, align 1
				%conv2.3 = zext i8 %tmp11 to i32
				%mul.3 = mul nuw nsw i32 %conv2.3, %conv.3
				%arrayidx3.3 = getelementptr inbounds i32, i32* %C, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx3.3, align 4
				%inc.3 = add i32 %i.010, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

				; CHECK-LABEL: mul_16x8
				; CHECK: @ %for.body

				; TODO: pre-inc store
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: ldrb{{.*}}, #1]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldrsh{{.*}}, #4]
				; CHECK-DEFAULT: ldrb{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldrsh{{.*}}, #6]
				; CHECK-DEFAULT: ldrb{{.*}}, #3]
				; CHECK-DEFAULT: str{{.*}}, #12]
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]!
				; CHECK-DEFAULT: ldrb{{.*}}, #4]!
				; CHECK-DEFAULT: str{{.*}}, #16]
				; CHECK-DEFAULT: add{{.*}}, #16

				; CHECK-COMPLEX: ldrsh
				; CHECK-COMPLEX: ldrb
				; CHECK-COMPLEX: str
				; CHECK-COMPLEX: ldrsh{{.*}}, #2]
				; CHECK-COMPLEX: ldrb{{.*}}, #1]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldrsh{{.*}}, #4]
				; CHECK-COMPLEX: ldrb{{.*}}, #2]
				; CHECK-COMPLEX: str{{.*}}, #8]
				; CHECK-COMPLEX: ldrsh{{.*}}, #6]
				; CHECK-COMPLEX: ldrb{{.*}}, #3]
				; CHECK-COMPLEX: str{{.*}}, #12]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul_16x8(i16* nocapture readonly %A, i8* nocapture readonly %B, i32* nocapture %C, i32 %N) {
				entry:
				%cmp9 = icmp eq i32 %N, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.010.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.010.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.010.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i16, i16* %A, i32 %i.010.epil
				%tmp2 = load i16, i16* %arrayidx.epil, align 2
				%conv.epil = sext i16 %tmp2 to i32
				%arrayidx1.epil = getelementptr inbounds i8, i8* %B, i32 %i.010.epil
				%tmp3 = load i8, i8* %arrayidx1.epil, align 1
				%conv2.epil = zext i8 %tmp3 to i32
				%mul.epil = mul nsw i32 %conv2.epil, %conv.epil
				%arrayidx3.epil = getelementptr inbounds i32, i32* %C, i32 %i.010.epil
				store i32 %mul.epil, i32* %arrayidx3.epil, align 4
				%inc.epil = add nuw i32 %i.010.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.010 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i16, i16* %A, i32 %i.010
				%tmp4 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %tmp4 to i32
				%arrayidx1 = getelementptr inbounds i8, i8* %B, i32 %i.010
				%tmp5 = load i8, i8* %arrayidx1, align 1
				%conv2 = zext i8 %tmp5 to i32
				%mul = mul nsw i32 %conv2, %conv
				%arrayidx3 = getelementptr inbounds i32, i32* %C, i32 %i.010
				store i32 %mul, i32* %arrayidx3, align 4
				%inc = or i32 %i.010, 1
				%arrayidx.1 = getelementptr inbounds i16, i16* %A, i32 %inc
				%tmp6 = load i16, i16* %arrayidx.1, align 2
				%conv.1 = sext i16 %tmp6 to i32
				%arrayidx1.1 = getelementptr inbounds i8, i8* %B, i32 %inc
				%tmp7 = load i8, i8* %arrayidx1.1, align 1
				%conv2.1 = zext i8 %tmp7 to i32
				%mul.1 = mul nsw i32 %conv2.1, %conv.1
				%arrayidx3.1 = getelementptr inbounds i32, i32* %C, i32 %inc
				store i32 %mul.1, i32* %arrayidx3.1, align 4
				%inc.1 = or i32 %i.010, 2
				%arrayidx.2 = getelementptr inbounds i16, i16* %A, i32 %inc.1
				%tmp8 = load i16, i16* %arrayidx.2, align 2
				%conv.2 = sext i16 %tmp8 to i32
				%arrayidx1.2 = getelementptr inbounds i8, i8* %B, i32 %inc.1
				%tmp9 = load i8, i8* %arrayidx1.2, align 1
				%conv2.2 = zext i8 %tmp9 to i32
				%mul.2 = mul nsw i32 %conv2.2, %conv.2
				%arrayidx3.2 = getelementptr inbounds i32, i32* %C, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx3.2, align 4
				%inc.2 = or i32 %i.010, 3
				%arrayidx.3 = getelementptr inbounds i16, i16* %A, i32 %inc.2
				%tmp10 = load i16, i16* %arrayidx.3, align 2
				%conv.3 = sext i16 %tmp10 to i32
				%arrayidx1.3 = getelementptr inbounds i8, i8* %B, i32 %inc.2
				%tmp11 = load i8, i8* %arrayidx1.3, align 1
				%conv2.3 = zext i8 %tmp11 to i32
				%mul.3 = mul nsw i32 %conv2.3, %conv.3
				%arrayidx3.3 = getelementptr inbounds i32, i32* %C, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx3.3, align 4
				%inc.3 = add i32 %i.010, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

				; CHECK-LABEL: mul_16x16
				; CHECK: @ %for.body

				; TODO: pre-inc store
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: ldrsh{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldrsh{{.*}}, #4]
				; CHECK-DEFAULT: ldrsh{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldrsh{{.*}}, #6]
				; CHECK-DEFAULT: ldrsh{{.*}}, #6]
				; CHECK-DEFAULT: str{{.*}}, #12]
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]!
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]!
				; CHECK-DEFAULT: str{{.*}}, #16]
				; CHECK-DEFAULT: add{{.*}}, #16

				; CHECK-COMPLEX: ldrsh
				; CHECK-COMPLEX: ldrsh
				; CHECK-COMPLEX: str
				; CHECK-COMPLEX: ldrsh{{.*}}, #2]
				; CHECK-COMPLEX: ldrsh{{.*}}, #2]
				; CHECK-COMPLEX: str{{.*}}, #4]
				; CHECK-COMPLEX: ldrsh{{.*}}, #4]
				; CHECK-COMPLEX: ldrsh{{.*}}, #4]
				; CHECK-COMPLEX: str{{.*}}, #8]
				; CHECK-COMPLEX: ldrsh{{.*}}, #6]
				; CHECK-COMPLEX: ldrsh{{.*}}, #6]
				; CHECK-COMPLEX: str{{.*}}, #12]

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul_16x16(i16* nocapture readonly %A, i16* nocapture readonly %B, i32* nocapture %C, i32 %N) {
				entry:
				%cmp9 = icmp eq i32 %N, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.010.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.010.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.010.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i16, i16* %A, i32 %i.010.epil
				%tmp2 = load i16, i16* %arrayidx.epil, align 2
				%conv.epil = sext i16 %tmp2 to i32
				%arrayidx1.epil = getelementptr inbounds i16, i16* %B, i32 %i.010.epil
				%tmp3 = load i16, i16* %arrayidx1.epil, align 2
				%conv2.epil = sext i16 %tmp3 to i32
				%mul.epil = mul nsw i32 %conv2.epil, %conv.epil
				%arrayidx3.epil = getelementptr inbounds i32, i32* %C, i32 %i.010.epil
				store i32 %mul.epil, i32* %arrayidx3.epil, align 4
				%inc.epil = add nuw i32 %i.010.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.010 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i16, i16* %A, i32 %i.010
				%tmp4 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %tmp4 to i32
				%arrayidx1 = getelementptr inbounds i16, i16* %B, i32 %i.010
				%tmp5 = load i16, i16* %arrayidx1, align 2
				%conv2 = sext i16 %tmp5 to i32
				%mul = mul nsw i32 %conv2, %conv
				%arrayidx3 = getelementptr inbounds i32, i32* %C, i32 %i.010
				store i32 %mul, i32* %arrayidx3, align 4
				%inc = or i32 %i.010, 1
				%arrayidx.1 = getelementptr inbounds i16, i16* %A, i32 %inc
				%tmp6 = load i16, i16* %arrayidx.1, align 2
				%conv.1 = sext i16 %tmp6 to i32
				%arrayidx1.1 = getelementptr inbounds i16, i16* %B, i32 %inc
				%tmp7 = load i16, i16* %arrayidx1.1, align 2
				%conv2.1 = sext i16 %tmp7 to i32
				%mul.1 = mul nsw i32 %conv2.1, %conv.1
				%arrayidx3.1 = getelementptr inbounds i32, i32* %C, i32 %inc
				store i32 %mul.1, i32* %arrayidx3.1, align 4
				%inc.1 = or i32 %i.010, 2
				%arrayidx.2 = getelementptr inbounds i16, i16* %A, i32 %inc.1
				%tmp8 = load i16, i16* %arrayidx.2, align 2
				%conv.2 = sext i16 %tmp8 to i32
				%arrayidx1.2 = getelementptr inbounds i16, i16* %B, i32 %inc.1
				%tmp9 = load i16, i16* %arrayidx1.2, align 2
				%conv2.2 = sext i16 %tmp9 to i32
				%mul.2 = mul nsw i32 %conv2.2, %conv.2
				%arrayidx3.2 = getelementptr inbounds i32, i32* %C, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx3.2, align 4
				%inc.2 = or i32 %i.010, 3
				%arrayidx.3 = getelementptr inbounds i16, i16* %A, i32 %inc.2
				%tmp10 = load i16, i16* %arrayidx.3, align 2
				%conv.3 = sext i16 %tmp10 to i32
				%arrayidx1.3 = getelementptr inbounds i16, i16* %B, i32 %inc.2
				%tmp11 = load i16, i16* %arrayidx1.3, align 2
				%conv2.3 = sext i16 %tmp11 to i32
				%mul.3 = mul nsw i32 %conv2.3, %conv.3
				%arrayidx3.3 = getelementptr inbounds i32, i32* %C, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx3.3, align 4
				%inc.3 = add i32 %i.010, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

				; CHECK-LABEL: mul_8x8_2d
				; CHECK: @ %for.body4.us

				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: ldrb{{.*}}
				; CHECK-DEFAULT: ldrb{{.*}}, #1]
				; CHECK-DEFAULT: str{{.*}}, #-12]
				; CHECK-DEFAULT: ldrb{{.*}}
				; CHECK-DEFAULT: ldrb{{.*}}, #2]
				; CHECK-DEFAULT: str{{.*}}, #-8]
				; CHECK-DEFAULT: ldrb{{.*}}
				; CHECK-DEFAULT: ldrb{{.*}}, #3]
				; CHECK-DEFAULT: str{{.*}}, #-4]
				; CHECK-DEFAULT: ldrb{{.*}}
				; CHECK-DEFAULT: ldrb{{.*}}, #4]!

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrb{{.*}}, #1]!
				; CHECK-T2: ldr{{.*}}, #4]!

				define void @mul_8x8_2d(i8* nocapture readonly %A, i8 nocapture readonly %B, i32 nocapture readonly %C, i32 %N, i32 %M) {
				entry:
				%cmp24 = icmp eq i32 %N, 0
				%cmp222 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp24, %cmp222
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.025.us = phi i32 [ %inc11.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i8, i8* %A, i32 %i.025.us
				%arrayidx5.us = getelementptr inbounds i8, i8* %B, i32 %i.025.us
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.025.us
				%.pre = load i8, i8* %arrayidx5.us, align 4
				%.pre30 = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%tmp2 = load i8, i8* %arrayidx.us, align 1
				%conv.us = zext i8 %tmp2 to i32
				%arrayidx6.us = getelementptr inbounds i8, i8* %.pre, i32 %j.023.us
				%tmp3 = load i8, i8* %arrayidx6.us, align 1
				%conv7.us = zext i8 %tmp3 to i32
				%mul.us = mul nuw nsw i32 %conv7.us, %conv.us
				%arrayidx9.us = getelementptr inbounds i32, i32* %.pre30, i32 %j.023.us
				%tmp4 = load i32, i32* %arrayidx9.us, align 4
				%add.us = add nsw i32 %tmp4, %mul.us
				store i32 %add.us, i32* %arrayidx9.us, align 4
				%inc.us = or i32 %j.023.us, 1
				%tmp5 = load i8, i8* %arrayidx.us, align 1
				%conv.us.1 = zext i8 %tmp5 to i32
				%arrayidx6.us.1 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us
				%tmp6 = load i8, i8* %arrayidx6.us.1, align 1
				%conv7.us.1 = zext i8 %tmp6 to i32
				%mul.us.1 = mul nuw nsw i32 %conv7.us.1, %conv.us.1
				%arrayidx9.us.1 = getelementptr inbounds i32, i32* %.pre30, i32 %inc.us
				%tmp7 = load i32, i32* %arrayidx9.us.1, align 4
				%add.us.1 = add nsw i32 %tmp7, %mul.us.1
				store i32 %add.us.1, i32* %arrayidx9.us.1, align 4
				%inc.us.1 = or i32 %j.023.us, 2
				%tmp8 = load i8, i8* %arrayidx.us, align 1
				%conv.us.2 = zext i8 %tmp8 to i32
				%arrayidx6.us.2 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.1
				%tmp9 = load i8, i8* %arrayidx6.us.2, align 1
				%conv7.us.2 = zext i8 %tmp9 to i32
				%mul.us.2 = mul nuw nsw i32 %conv7.us.2, %conv.us.2
				%arrayidx9.us.2 = getelementptr inbounds i32, i32* %.pre30, i32 %inc.us.1
				%tmp10 = load i32, i32* %arrayidx9.us.2, align 4
				%add.us.2 = add nsw i32 %tmp10, %mul.us.2
				store i32 %add.us.2, i32* %arrayidx9.us.2, align 4
				%inc.us.2 = or i32 %j.023.us, 3
				%tmp11 = load i8, i8* %arrayidx.us, align 1
				%conv.us.3 = zext i8 %tmp11 to i32
				%arrayidx6.us.3 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.2
				%tmp12 = load i8, i8* %arrayidx6.us.3, align 1
				%conv7.us.3 = zext i8 %tmp12 to i32
				%mul.us.3 = mul nuw nsw i32 %conv7.us.3, %conv.us.3
				%arrayidx9.us.3 = getelementptr inbounds i32, i32* %.pre30, i32 %inc.us.2
				%tmp13 = load i32, i32* %arrayidx9.us.3, align 4
				%add.us.3 = add nsw i32 %tmp13, %mul.us.3
				store i32 %add.us.3, i32* %arrayidx9.us.3, align 4
				%inc.us.3 = add i32 %j.023.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%j.023.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.023.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%tmp14 = load i8, i8* %arrayidx.us, align 1
				%conv.us.epil = zext i8 %tmp14 to i32
				%arrayidx6.us.epil = getelementptr inbounds i8, i8* %.pre, i32 %j.023.us.epil
				%tmp15 = load i8, i8* %arrayidx6.us.epil, align 1
				%conv7.us.epil = zext i8 %tmp15 to i32
				%mul.us.epil = mul nuw nsw i32 %conv7.us.epil, %conv.us.epil
				%arrayidx9.us.epil = getelementptr inbounds i32, i32* %.pre30, i32 %j.023.us.epil
				%tmp16 = load i32, i32* %arrayidx9.us.epil, align 4
				%add.us.epil = add nsw i32 %tmp16, %mul.us.epil
				store i32 %add.us.epil, i32* %arrayidx9.us.epil, align 4
				%inc.us.epil = add nuw i32 %j.023.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%inc11.us = add nuw i32 %i.025.us, 1
				%exitcond28 = icmp eq i32 %inc11.us, %N
				br i1 %exitcond28, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mul_16x16_2d
				; CHECK: @ %for.body4.us

				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: ldrsh{{.*}}, #8]!

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!
				; CHECK-T2: ldr{{.*}}, #4]!

				define void @mul_16x16_2d(i16* nocapture readonly %A, i16 nocapture readonly %B, i32 nocapture readonly %C, i32 %N, i32 %M) {
				entry:
				%cmp24 = icmp eq i32 %N, 0
				%cmp222 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp24, %cmp222
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.025.us = phi i32 [ %inc11.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i16, i16* %A, i32 %i.025.us
				%tmp2 = load i16, i16* %arrayidx.us, align 2
				%conv.us = sext i16 %tmp2 to i32
				%arrayidx5.us = getelementptr inbounds i16, i16* %B, i32 %i.025.us
				%tmp3 = load i16, i16* %arrayidx5.us, align 4
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.025.us
				%tmp4 = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%arrayidx6.us = getelementptr inbounds i16, i16* %tmp3, i32 %j.023.us
				%tmp5 = load i16, i16* %arrayidx6.us, align 2
				%conv7.us = sext i16 %tmp5 to i32
				%mul.us = mul nsw i32 %conv7.us, %conv.us
				%arrayidx9.us = getelementptr inbounds i32, i32* %tmp4, i32 %j.023.us
				%tmp6 = load i32, i32* %arrayidx9.us, align 4
				%add.us = add nsw i32 %tmp6, %mul.us
				store i32 %add.us, i32* %arrayidx9.us, align 4
				%inc.us = or i32 %j.023.us, 1
				%arrayidx6.us.1 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us
				%tmp7 = load i16, i16* %arrayidx6.us.1, align 2
				%conv7.us.1 = sext i16 %tmp7 to i32
				%mul.us.1 = mul nsw i32 %conv7.us.1, %conv.us
				%arrayidx9.us.1 = getelementptr inbounds i32, i32* %tmp4, i32 %inc.us
				%tmp8 = load i32, i32* %arrayidx9.us.1, align 4
				%add.us.1 = add nsw i32 %tmp8, %mul.us.1
				store i32 %add.us.1, i32* %arrayidx9.us.1, align 4
				%inc.us.1 = or i32 %j.023.us, 2
				%arrayidx6.us.2 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.1
				%tmp9 = load i16, i16* %arrayidx6.us.2, align 2
				%conv7.us.2 = sext i16 %tmp9 to i32
				%mul.us.2 = mul nsw i32 %conv7.us.2, %conv.us
				%arrayidx9.us.2 = getelementptr inbounds i32, i32* %tmp4, i32 %inc.us.1
				%tmp10 = load i32, i32* %arrayidx9.us.2, align 4
				%add.us.2 = add nsw i32 %tmp10, %mul.us.2
				store i32 %add.us.2, i32* %arrayidx9.us.2, align 4
				%inc.us.2 = or i32 %j.023.us, 3
				%arrayidx6.us.3 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.2
				%tmp11 = load i16, i16* %arrayidx6.us.3, align 2
				%conv7.us.3 = sext i16 %tmp11 to i32
				%mul.us.3 = mul nsw i32 %conv7.us.3, %conv.us
				%arrayidx9.us.3 = getelementptr inbounds i32, i32* %tmp4, i32 %inc.us.2
				%tmp12 = load i32, i32* %arrayidx9.us.3, align 4
				%add.us.3 = add nsw i32 %tmp12, %mul.us.3
				store i32 %add.us.3, i32* %arrayidx9.us.3, align 4
				%inc.us.3 = add i32 %j.023.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%j.023.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%j.023.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.023.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%arrayidx6.us.epil = getelementptr inbounds i16, i16* %tmp3, i32 %j.023.us.epil
				%tmp13 = load i16, i16* %arrayidx6.us.epil, align 2
				%conv7.us.epil = sext i16 %tmp13 to i32
				%mul.us.epil = mul nsw i32 %conv7.us.epil, %conv.us
				%arrayidx9.us.epil = getelementptr inbounds i32, i32* %tmp4, i32 %j.023.us.epil
				%tmp14 = load i32, i32* %arrayidx9.us.epil, align 4
				%add.us.epil = add nsw i32 %tmp14, %mul.us.epil
				store i32 %add.us.epil, i32* %arrayidx9.us.epil, align 4
				%inc.us.epil = add nuw i32 %j.023.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%inc11.us = add nuw i32 %i.025.us, 1
				%exitcond28 = icmp eq i32 %inc11.us, %N
				br i1 %exitcond28, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mac_8x8_2d
				; CHECK: @ %for.body4.us

				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: str{{.*}}, lsl #2]
				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}, #1]
				; CHECK-BASE: str{{.*}}, lsl #2]
				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}, #2]
				; CHECK-BASE: str{{.*}}, lsl #2]
				; CHECK-BASE: ldrb{{.*}}
				; CHECK-BASE: ldrb{{.*}}, #3]
				; CHECK-BASE: str{{.*}}, lsl #2]

				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: str{{.*}}, lsl #2]
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}, #1]
				; CHECK-COMPLEX: str{{.*}}, lsl #2]
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}, #2]
				; CHECK-COMPLEX: str{{.*}}, lsl #2]
				; CHECK-COMPLEX: ldrb{{.*}}
				; CHECK-COMPLEX: ldrb{{.*}}, #3]
				; CHECK-COMPLEX: str{{.*}}, lsl #2]

				; CHECK-COLLAPSE: ldrb{{.*}},
				; CHECK-COLLAPSE: ldrb{{.*}}, #1]
				; CHECK-COLLAPSE: str{{.*}}, lsl #2]
				; CHECK-COLLAPSE: ldrb{{.*}}, #2]
				; CHECK-COLLAPSE: str{{.*}}, lsl #2]
				; CHECK-COLLAPSE: ldrb{{.*}}, #3]
				; CHECK-COLLAPSE: str{{.*}}, lsl #2]
				; CHECK-COLLAPSE: ldrb{{.*}}, #4]!

				; DISABLED-NOT: ldr{{.*}}]!
				; DISABLED-NOT: str{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrb{{.*}}, #1]!

				define void @mac_8x8_2d(i8* nocapture readonly %A, i8** nocapture readonly %B, i32* nocapture %C, i32 %N, i32 %M) {
				entry:
				%cmp22 = icmp eq i32 %N, 0
				%cmp220 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp22, %cmp220
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.023.us = phi i32 [ %inc10.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i8, i8* %A, i32 %i.023.us
				%arrayidx5.us = getelementptr inbounds i8, i8* %B, i32 %i.023.us
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.023.us
				%.pre = load i8, i8* %arrayidx5.us, align 4
				%.pre28 = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%tmp2 = phi i32 [ %add.us.3, %for.body4.us ], [ %.pre28, %for.cond1.preheader.us ]
				%j.021.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%tmp3 = load i8, i8* %arrayidx.us, align 1
				%conv.us = zext i8 %tmp3 to i32
				%arrayidx6.us = getelementptr inbounds i8, i8* %.pre, i32 %j.021.us
				%tmp4 = load i8, i8* %arrayidx6.us, align 1
				%conv7.us = zext i8 %tmp4 to i32
				%mul.us = mul nuw nsw i32 %conv7.us, %conv.us
				%add.us = add nsw i32 %mul.us, %tmp2
				store i32 %add.us, i32* %arrayidx8.us, align 4
				%inc.us = or i32 %j.021.us, 1
				%tmp5 = load i8, i8* %arrayidx.us, align 1
				%conv.us.1 = zext i8 %tmp5 to i32
				%arrayidx6.us.1 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us
				%tmp6 = load i8, i8* %arrayidx6.us.1, align 1
				%conv7.us.1 = zext i8 %tmp6 to i32
				%mul.us.1 = mul nuw nsw i32 %conv7.us.1, %conv.us.1
				%add.us.1 = add nsw i32 %mul.us.1, %add.us
				store i32 %add.us.1, i32* %arrayidx8.us, align 4
				%inc.us.1 = or i32 %j.021.us, 2
				%tmp7 = load i8, i8* %arrayidx.us, align 1
				%conv.us.2 = zext i8 %tmp7 to i32
				%arrayidx6.us.2 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.1
				%tmp8 = load i8, i8* %arrayidx6.us.2, align 1
				%conv7.us.2 = zext i8 %tmp8 to i32
				%mul.us.2 = mul nuw nsw i32 %conv7.us.2, %conv.us.2
				%add.us.2 = add nsw i32 %mul.us.2, %add.us.1
				store i32 %add.us.2, i32* %arrayidx8.us, align 4
				%inc.us.2 = or i32 %j.021.us, 3
				%tmp9 = load i8, i8* %arrayidx.us, align 1
				%conv.us.3 = zext i8 %tmp9 to i32
				%arrayidx6.us.3 = getelementptr inbounds i8, i8* %.pre, i32 %inc.us.2
				%tmp10 = load i8, i8* %arrayidx6.us.3, align 1
				%conv7.us.3 = zext i8 %tmp10 to i32
				%mul.us.3 = mul nuw nsw i32 %conv7.us.3, %conv.us.3
				%add.us.3 = add nsw i32 %mul.us.3, %add.us.2
				store i32 %add.us.3, i32* %arrayidx8.us, align 4
				%inc.us.3 = add i32 %j.021.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%.unr = phi i32 [ %.pre28, %for.cond1.preheader.us ], [ %add.us.3, %for.body4.us ]
				%j.021.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%tmp11 = phi i32 [ %add.us.epil, %for.body4.us.epil ], [ %.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%j.021.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.021.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%tmp12 = load i8, i8* %arrayidx.us, align 1
				%conv.us.epil = zext i8 %tmp12 to i32
				%arrayidx6.us.epil = getelementptr inbounds i8, i8* %.pre, i32 %j.021.us.epil
				%tmp13 = load i8, i8* %arrayidx6.us.epil, align 1
				%conv7.us.epil = zext i8 %tmp13 to i32
				%mul.us.epil = mul nuw nsw i32 %conv7.us.epil, %conv.us.epil
				%add.us.epil = add nsw i32 %mul.us.epil, %tmp11
				store i32 %add.us.epil, i32* %arrayidx8.us, align 4
				%inc.us.epil = add nuw i32 %j.021.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%inc10.us = add nuw i32 %i.023.us, 1
				%exitcond26 = icmp eq i32 %inc10.us, %N
				br i1 %exitcond26, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mac_16x16_2d
				; CHECK: @ %for.body4.us

				; CHECK-BASE: ldrsh{{.*}}, lsl #1]
				; CHECK-BASE: ldrsh{{.*}}, #2]
				; CHECK-BASE: ldrsh{{.*}}, #4]
				; CHECK-BASE: ldrsh{{.*}}, #6]

				; CHECK-COMPLEX: ldrsh{{.*}}, lsl #1]
				; CHECK-COMPLEX: ldrsh{{.*}}, #2]
				; CHECK-COMPLEX: ldrsh{{.*}}, #4]
				; CHECK-COMPLEX: ldrsh{{.*}}, #6]

				; CHECK-COLLAPSE: ldrsh{{.*}}, #8]!

				; DISABLED-NOT: ldr{{.*}}]!

				; CHECK-T2: @ %for.body4.us.epil
				; CHECK-T2: ldrsh{{.*}}, #2]!

				define void @mac_16x16_2d(i16* nocapture readonly %A, i16** nocapture readonly %B, i32* nocapture %C, i32 %N, i32 %M) {
				entry:
				%cmp23 = icmp eq i32 %N, 0
				%cmp220 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp23, %cmp220
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%tmp = add i32 %M, -1
				%xtraiter = and i32 %M, 3
				%tmp1 = icmp ult i32 %tmp, 3
				%unroll_iter = sub i32 %M, %xtraiter
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%i.024.us = phi i32 [ %inc10.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i16, i16* %A, i32 %i.024.us
				%tmp2 = load i16, i16* %arrayidx.us, align 2
				%conv.us = sext i16 %tmp2 to i32
				%arrayidx5.us = getelementptr inbounds i16, i16* %B, i32 %i.024.us
				%tmp3 = load i16, i16* %arrayidx5.us, align 4
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.024.us
				%arrayidx8.promoted.us = load i32, i32* %arrayidx8.us, align 4
				br i1 %tmp1, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%add22.us = phi i32 [ %add.us.3, %for.body4.us ], [ %arrayidx8.promoted.us, %for.cond1.preheader.us ]
				%j.021.us = phi i32 [ %inc.us.3, %for.body4.us ], [ 0, %for.cond1.preheader.us ]
				%niter = phi i32 [ %niter.nsub.3, %for.body4.us ], [ %unroll_iter, %for.cond1.preheader.us ]
				%arrayidx6.us = getelementptr inbounds i16, i16* %tmp3, i32 %j.021.us
				%tmp4 = load i16, i16* %arrayidx6.us, align 2
				%conv7.us = sext i16 %tmp4 to i32
				%mul.us = mul nsw i32 %conv7.us, %conv.us
				%add.us = add nsw i32 %mul.us, %add22.us
				%inc.us = or i32 %j.021.us, 1
				%arrayidx6.us.1 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us
				%tmp5 = load i16, i16* %arrayidx6.us.1, align 2
				%conv7.us.1 = sext i16 %tmp5 to i32
				%mul.us.1 = mul nsw i32 %conv7.us.1, %conv.us
				%add.us.1 = add nsw i32 %mul.us.1, %add.us
				%inc.us.1 = or i32 %j.021.us, 2
				%arrayidx6.us.2 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.1
				%tmp6 = load i16, i16* %arrayidx6.us.2, align 2
				%conv7.us.2 = sext i16 %tmp6 to i32
				%mul.us.2 = mul nsw i32 %conv7.us.2, %conv.us
				%add.us.2 = add nsw i32 %mul.us.2, %add.us.1
				%inc.us.2 = or i32 %j.021.us, 3
				%arrayidx6.us.3 = getelementptr inbounds i16, i16* %tmp3, i32 %inc.us.2
				%tmp7 = load i16, i16* %arrayidx6.us.3, align 2
				%conv7.us.3 = sext i16 %tmp7 to i32
				%mul.us.3 = mul nsw i32 %conv7.us.3, %conv.us
				%add.us.3 = add nsw i32 %mul.us.3, %add.us.2
				%inc.us.3 = add i32 %j.021.us, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa: ; preds = %for.body4.us, %for.cond1.preheader.us
				%add.us.lcssa.ph = phi i32 [ undef, %for.cond1.preheader.us ], [ %add.us.3, %for.body4.us ]
				%add22.us.unr = phi i32 [ %arrayidx8.promoted.us, %for.cond1.preheader.us ], [ %add.us.3, %for.body4.us ]
				%j.021.us.unr = phi i32 [ 0, %for.cond1.preheader.us ], [ %inc.us.3, %for.body4.us ]
				br i1 %lcmp.mod, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.body4.us.epil: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%add22.us.epil = phi i32 [ %add.us.epil, %for.body4.us.epil ], [ %add22.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%j.021.us.epil = phi i32 [ %inc.us.epil, %for.body4.us.epil ], [ %j.021.us.unr, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body4.us.epil ], [ %xtraiter, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ]
				%arrayidx6.us.epil = getelementptr inbounds i16, i16* %tmp3, i32 %j.021.us.epil
				%tmp8 = load i16, i16* %arrayidx6.us.epil, align 2
				%conv7.us.epil = sext i16 %tmp8 to i32
				%mul.us.epil = mul nsw i32 %conv7.us.epil, %conv.us
				%add.us.epil = add nsw i32 %mul.us.epil, %add22.us.epil
				%inc.us.epil = add nuw i32 %j.021.us.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.epil

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us.epil, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa
				%add.us.lcssa = phi i32 [ %add.us.lcssa.ph, %for.cond1.for.cond.cleanup3_crit_edge.us.unr-lcssa ], [ %add.us.epil, %for.body4.us.epil ]
				store i32 %add.us.lcssa, i32* %arrayidx8.us, align 4
				%inc10.us = add nuw i32 %i.024.us, 1
				%exitcond27 = icmp eq i32 %inc10.us, %N
				br i1 %exitcond27, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				; CHECK-LABEL: mul32x32_backwards
				; CHECK: @ %for.body

				; TODO: post increments for decreasing addresses
				; CHECK-DEFAULT-NOT: ldr{{.*}}]!
				; CHECK-DEFAULT-NOT: str{{.*}}]!

				; CHECK-COMPLEX-NOT: ldr{{.*}}]!
				; CHECK-COMPLEX-NOT: str{{.*}}]!

				define void @mul32x32_backwards(i32* nocapture %a, i32* nocapture readonly %b, i32* nocapture readonly %c, i32 %N) {
				entry:
				%i.08 = add i32 %N, -1
				%cmp9 = icmp sgt i32 %i.08, -1
				br i1 %cmp9, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%xtraiter = and i32 %N, 3
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.body.prol.loopexit, label %for.body.prol

				for.body.prol: ; preds = %for.body.prol, %for.body.preheader
				%i.010.prol = phi i32 [ %i.0.prol, %for.body.prol ], [ %i.08, %for.body.preheader ]
				%prol.iter = phi i32 [ %prol.iter.sub, %for.body.prol ], [ %xtraiter, %for.body.preheader ]
				%arrayidx.prol = getelementptr inbounds i32, i32* %b, i32 %i.010.prol
				%tmp = load i32, i32* %arrayidx.prol, align 4
				%arrayidx1.prol = getelementptr inbounds i32, i32* %c, i32 %i.010.prol
				%tmp1 = load i32, i32* %arrayidx1.prol, align 4
				%mul.prol = mul nsw i32 %tmp1, %tmp
				%arrayidx2.prol = getelementptr inbounds i32, i32* %a, i32 %i.010.prol
				store i32 %mul.prol, i32* %arrayidx2.prol, align 4
				%i.0.prol = add i32 %i.010.prol, -1
				%prol.iter.sub = add i32 %prol.iter, -1
				%prol.iter.cmp = icmp eq i32 %prol.iter.sub, 0
				br i1 %prol.iter.cmp, label %for.body.prol.loopexit, label %for.body.prol

				for.body.prol.loopexit: ; preds = %for.body.prol, %for.body.preheader
				%i.010.unr = phi i32 [ %i.08, %for.body.preheader ], [ %i.0.prol, %for.body.prol ]
				%tmp2 = icmp ult i32 %i.08, 3
				br i1 %tmp2, label %for.cond.cleanup, label %for.body

				for.cond.cleanup: ; preds = %for.body, %for.body.prol.loopexit, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.prol.loopexit
				%i.010 = phi i32 [ %i.0.3, %for.body ], [ %i.010.unr, %for.body.prol.loopexit ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.010
				%tmp3 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %c, i32 %i.010
				%tmp4 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %tmp4, %tmp3
				%arrayidx2 = getelementptr inbounds i32, i32* %a, i32 %i.010
				store i32 %mul, i32* %arrayidx2, align 4
				%i.0 = add i32 %i.010, -1
				%arrayidx.1 = getelementptr inbounds i32, i32* %b, i32 %i.0
				%tmp5 = load i32, i32* %arrayidx.1, align 4
				%arrayidx1.1 = getelementptr inbounds i32, i32* %c, i32 %i.0
				%tmp6 = load i32, i32* %arrayidx1.1, align 4
				%mul.1 = mul nsw i32 %tmp6, %tmp5
				%arrayidx2.1 = getelementptr inbounds i32, i32* %a, i32 %i.0
				store i32 %mul.1, i32* %arrayidx2.1, align 4
				%i.0.1 = add i32 %i.010, -2
				%arrayidx.2 = getelementptr inbounds i32, i32* %b, i32 %i.0.1
				%tmp7 = load i32, i32* %arrayidx.2, align 4
				%arrayidx1.2 = getelementptr inbounds i32, i32* %c, i32 %i.0.1
				%tmp8 = load i32, i32* %arrayidx1.2, align 4
				%mul.2 = mul nsw i32 %tmp8, %tmp7
				%arrayidx2.2 = getelementptr inbounds i32, i32* %a, i32 %i.0.1
				store i32 %mul.2, i32* %arrayidx2.2, align 4
				%i.0.2 = add i32 %i.010, -3
				%arrayidx.3 = getelementptr inbounds i32, i32* %b, i32 %i.0.2
				%tmp9 = load i32, i32* %arrayidx.3, align 4
				%arrayidx1.3 = getelementptr inbounds i32, i32* %c, i32 %i.0.2
				%tmp10 = load i32, i32* %arrayidx1.3, align 4
				%mul.3 = mul nsw i32 %tmp10, %tmp9
				%arrayidx2.3 = getelementptr inbounds i32, i32* %a, i32 %i.0.2
				store i32 %mul.3, i32* %arrayidx2.3, align 4
				%i.0.3 = add i32 %i.010, -4
				%cmp.3 = icmp sgt i32 %i.0.3, -1
				br i1 %cmp.3, label %for.body, label %for.cond.cleanup
				}

				; CHECK-LABEL: mul32x32_forwards
				; CHECK: @ %for.body

				; TODO: pre-inc store
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #4]
				; CHECK-DEFAULT: str{{.*}}, #4]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #8]
				; CHECK-DEFAULT: str{{.*}}, #8]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: ldr{{.*}}, #12]
				; CHECK-DEFAULT: str{{.*}}, #12]
				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: ldr{{.*}}, #16]!
				; CHECK-DEFAULT: str{{.*}}, #16]
				; CHECK-DEFAULT: add{{.*}}, #16

				; TODO: Higher complexity results in 22 instructions vs 20.
				; CHECK-COMPLEX-NOT: ldr{{.*}}, #16]!
				; CHECK-COMPLEX-NOT: str{{.*}}, #16]!

				; CHECK-T2: @ %for.body.epil
				; CHECK-T2: ldr{{.*}}, #4]!
				; CHECK-T2: ldr{{.*}}, #4]!
				; CHECK-T2: str{{.*}}, #4]!

				define void @mul32x32_forwards(i32* nocapture %a, i32* nocapture readonly %b, i32* nocapture readonly %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				br i1 %cmp8, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%tmp = add i32 %N, -1
				%xtraiter = and i32 %N, 3
				%tmp1 = icmp ult i32 %tmp, 3
				br i1 %tmp1, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body.preheader.new

				for.body.preheader.new: ; preds = %for.body.preheader
				%unroll_iter = sub i32 %N, %xtraiter
				br label %for.body

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.body, %for.body.preheader
				%i.09.unr = phi i32 [ 0, %for.body.preheader ], [ %inc.3, %for.body ]
				%lcmp.mod = icmp eq i32 %xtraiter, 0
				br i1 %lcmp.mod, label %for.cond.cleanup, label %for.body.epil

				for.body.epil: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa
				%i.09.epil = phi i32 [ %inc.epil, %for.body.epil ], [ %i.09.unr, %for.cond.cleanup.loopexit.unr-lcssa ]
				%epil.iter = phi i32 [ %epil.iter.sub, %for.body.epil ], [ %xtraiter, %for.cond.cleanup.loopexit.unr-lcssa ]
				%arrayidx.epil = getelementptr inbounds i32, i32* %b, i32 %i.09.epil
				%tmp2 = load i32, i32* %arrayidx.epil, align 4
				%arrayidx1.epil = getelementptr inbounds i32, i32* %c, i32 %i.09.epil
				%tmp3 = load i32, i32* %arrayidx1.epil, align 4
				%mul.epil = mul nsw i32 %tmp3, %tmp2
				%arrayidx2.epil = getelementptr inbounds i32, i32* %a, i32 %i.09.epil
				store i32 %mul.epil, i32* %arrayidx2.epil, align 4
				%inc.epil = add nuw nsw i32 %i.09.epil, 1
				%epil.iter.sub = add i32 %epil.iter, -1
				%epil.iter.cmp = icmp eq i32 %epil.iter.sub, 0
				br i1 %epil.iter.cmp, label %for.cond.cleanup, label %for.body.epil

				for.cond.cleanup: ; preds = %for.body.epil, %for.cond.cleanup.loopexit.unr-lcssa, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader.new
				%i.09 = phi i32 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
				%niter = phi i32 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.3, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.09
				%tmp4 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %c, i32 %i.09
				%tmp5 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %tmp5, %tmp4
				%arrayidx2 = getelementptr inbounds i32, i32* %a, i32 %i.09
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = or i32 %i.09, 1
				%arrayidx.1 = getelementptr inbounds i32, i32* %b, i32 %inc
				%tmp6 = load i32, i32* %arrayidx.1, align 4
				%arrayidx1.1 = getelementptr inbounds i32, i32* %c, i32 %inc
				%tmp7 = load i32, i32* %arrayidx1.1, align 4
				%mul.1 = mul nsw i32 %tmp7, %tmp6
				%arrayidx2.1 = getelementptr inbounds i32, i32* %a, i32 %inc
				store i32 %mul.1, i32* %arrayidx2.1, align 4
				%inc.1 = or i32 %i.09, 2
				%arrayidx.2 = getelementptr inbounds i32, i32* %b, i32 %inc.1
				%tmp8 = load i32, i32* %arrayidx.2, align 4
				%arrayidx1.2 = getelementptr inbounds i32, i32* %c, i32 %inc.1
				%tmp9 = load i32, i32* %arrayidx1.2, align 4
				%mul.2 = mul nsw i32 %tmp9, %tmp8
				%arrayidx2.2 = getelementptr inbounds i32, i32* %a, i32 %inc.1
				store i32 %mul.2, i32* %arrayidx2.2, align 4
				%inc.2 = or i32 %i.09, 3
				%arrayidx.3 = getelementptr inbounds i32, i32* %b, i32 %inc.2
				%tmp10 = load i32, i32* %arrayidx.3, align 4
				%arrayidx1.3 = getelementptr inbounds i32, i32* %c, i32 %inc.2
				%tmp11 = load i32, i32* %arrayidx1.3, align 4
				%mul.3 = mul nsw i32 %tmp11, %tmp10
				%arrayidx2.3 = getelementptr inbounds i32, i32* %a, i32 %inc.2
				store i32 %mul.3, i32* %arrayidx2.3, align 4
				%inc.3 = add nuw nsw i32 %i.09, 4
				%niter.nsub.3 = add i32 %niter, -4
				%niter.ncmp.3 = icmp eq i32 %niter.nsub.3, 0
				br i1 %niter.ncmp.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body
				}

test/Transforms/LoopStrengthReduce/ARM/complexity.ll

	target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"

	; RUN: opt -mtriple=thumbv7em %s -S -loop-reduce -lsr-complexity-limit=65536 -o - \| FileCheck %s --check-prefix=CHECK-DEFAULT			; RUN: opt -mtriple=thumbv7em -mcpu=cortex-m4 %s -S -loop-reduce -lsr-complexity-limit=65536 -o - \| FileCheck %s
	; RUN: opt -mtriple=thumbv7em %s -S -loop-reduce -lsr-complexity-limit=2147483647 -o - \| FileCheck %s --check-prefix=CHECK-COMPLEX			; RUN: opt -mtriple=thumbv7em -mcpu=cortex-m4 %s -S -loop-reduce -lsr-complexity-limit=2147483647 -o - \| FileCheck %s

	; CHECK-DEFAULT-LABEL: for.body12.us.us:			; CHECK-LABEL: for.cond9.preheader.us.us:
	; CHECK-DEFAULT: phi i32			; CHECK: [[SCEVGEP:%[^ ]+]] = getelementptr i16, i16* %tmp5, i32 -4
	; CHECK-DEFAULT: [[LSR_IV:%[^ ]+]] = phi i32 [ [[LSR_IV_NEXT:%[^ ]+]], %for.body12.us.us ], [ 0, %for.cond9.preheader.us.us ]			; CHECK: [[SCEVGEP9:%[^ ]+]] = getelementptr i16, i16* %tmp6, i32 %lsr.iv
	; CHECK-DEFAULT: phi i32
	; CHECK-DEFAULT: [[LSR_IV_NEXT]] = add i32 [[LSR_IV]], 8			; CHECK-LABEL: for.body12.us.us:
				; CHECK: [[LSR_IV10:%[^ ]+]] = phi i16* [ [[SCEVGEP11:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP9]], %for.cond9.preheader.us.us ]
	; CHECK-COMPLEX-LABEL: for.body12.us.us:			; CHECK: [[LSR_IV:%[^ ]+]] = phi i16* [ [[SCEVGEP1:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP]], %for.cond9.preheader.us.us ]
	; CHECK-COMPLEX: phi i32			; CHECK: getelementptr i16, i16* [[LSR_IV]], i32 4
	; CHECK-COMPLEX: [[LSR_IV6:%[^ ]+]] = phi i16* [ [[SCEVGEP7:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP5:%[^ ]+]], %for.cond9.preheader.us.us ]			; CHECK: getelementptr i16, i16* [[LSR_IV10]], i32 4
	; CHECK-COMPLEX: [[LSR_IV:%[^ ]+]] = phi i16* [ [[SCEVGEP1:%[^ ]+]], %for.body12.us.us ], [ [[SCEVGEP:%[^ ]+]], %for.cond9.preheader.us.us ]			; CHECK: getelementptr i16, i16* [[LSR_IV]], i32 5
	; CHECK-COMPLEX: phi i32			; CHECK: getelementptr i16, i16* [[LSR_IV10]], i32 5
	; CHECK-COMPLEX: [[SCEVGEP1]] = getelementptr i16, i16* [[LSR_IV]], i32 4			; CHECK: getelementptr i16, i16* [[LSR_IV]], i32 6
	; CHECK-COMPLEX: [[SCEVGEP7]] = getelementptr i16, i16* [[LSR_IV6]], i32 4			; CHECK: getelementptr i16, i16* [[LSR_IV10]], i32 6
				; CHECK: getelementptr i16, i16* [[LSR_IV]], i32 7
				; CHECK: getelementptr i16, i16* [[LSR_IV10]], i32 7
				; CHECK: [[SCEVGEP1]] = getelementptr i16, i16* [[LSR_IV]], i32 4
				; CHECK: [[SCEVGEP11]] = getelementptr i16, i16* [[LSR_IV10]], i32 4
				gilrUnsubmitted Not Done Reply Inline Actions Shouldn't -DEFAULT be removed here too? gilr: Shouldn't -DEFAULT be removed here too?
				samparkerAuthorUnsubmitted Done Reply Inline Actions thanks! samparker: thanks!

	define void @convolve(i16 nocapture readonly %input_image, i16 nocapture readonly %filter, i32 %filter_dim, i32 %out_width, i32 %out_height, i32** nocapture readonly %convolved) {			define void @convolve(i16 nocapture readonly %input_image, i16 nocapture readonly %filter, i32 %filter_dim, i32 %out_width, i32 %out_height, i32** nocapture readonly %convolved) {
	entry:			entry:
	%cmp92 = icmp eq i32 %out_height, 0			%cmp92 = icmp eq i32 %out_height, 0
	br i1 %cmp92, label %for.cond.cleanup, label %for.cond1.preheader.lr.ph			br i1 %cmp92, label %for.cond.cleanup, label %for.cond1.preheader.lr.ph

	for.cond1.preheader.lr.ph: ; preds = %entry			for.cond1.preheader.lr.ph: ; preds = %entry
	%xtraiter = and i32 %filter_dim, 3			%xtraiter = and i32 %filter_dim, 3
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LSR] Generate formulae to enable more indexed accessesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 181270

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/ARM/ARMTargetTransformInfo.h

lib/Transforms/Scalar/LoopStrengthReduce.cpp

test/CodeGen/ARM/dsp-loop-indexing.ll

test/CodeGen/ARM/loop-align-cortex-m.ll

test/CodeGen/ARM/loop-indexing.ll

test/Transforms/LoopStrengthReduce/ARM/complexity.ll

[LSR] Generate formulae to enable more indexed accesses
ClosedPublic