This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetLowering.h
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
-
DAGCombiner.cpp
-
Target/ARM/
-
ARM/
-
ARM.h
-
ARMISelLowering.h
4/8
ARMISelLowering.cpp
-
ARMPostIndexingOptimizer.cpp
-
ARMTargetMachine.cpp
-
CMakeLists.txt
-
test/
-
CodeGen/ARM/
-
ARM/
-
O3-pipeline.ll
-
arm-post-indexing-opt-ir.ll
1
arm-post-indexing-opt.ll
-
misched-fusion-aes.ll
-
Transforms/LoopStrengthReduce/ARM/
-
LoopStrengthReduce/
-
ARM/
-
ivchain-ARM.ll

Differential D108988

[ARM] Simplify address calculation for NEON load/store
ClosedPublic

Authored by asavonic on Aug 31 2021, 5:28 AM.

Download Raw Diff

Details

Reviewers

dmgreen
t.p.northover

Commits

rGdc8a41de3493: [ARM] Simplify address calculation for NEON load/store

Summary

The patch attempts to optimize a sequence of SIMD loads from the same
base pointer:

%0 = gep float*, float* base, i32 4
%1 = bitcast float* %0 to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1
...
%n1 = gep float*, float* base, i32 N
%n2 = bitcast float* %n1 to <4 x float>*
%n3 = load <4 x float>, <4 x float>* %n2

For AArch64 the compiler generates a sequence of LDR Qt, [Xn, #16].
However, 32-bit NEON VLD1/VST1 lack the [Wn, #imm] addressing mode, so
the address is computed before every ld/st instruction:

add r2, r0, #32
add r0, r0, #16
vld1.32 {d18, d19}, [r2]
vld1.32 {d22, d23}, [r0]

This can be improved by computing address for the first load, and then
using a post-indexed form of VLD1/VST1 to load the rest:

add r0, r0, #16
vld1.32 {d18, d19}, [r0]!
vld1.32 {d22, d23}, [r0]

In order to do that, the patch adds more patterns to DAGCombine:

(load (add ptr inc1)) and (add ptr inc2) are now folded if inc1 and inc2 are constants.
(or ptr inc) is now recognized as a pointer increment if ptr is sufficiently aligned.

In addition to that, we now search for all possible base updates and
then pick the best one.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

asavonic created this revision.Aug 31 2021, 5:28 AM

Herald added subscribers: ecnelises, arphaman, hiraditya and 2 others. · View Herald TranscriptAug 31 2021, 5:28 AM

asavonic requested review of this revision.Aug 31 2021, 5:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2021, 5:28 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B121917: Diff 369681.Aug 31 2021, 5:57 AM

My first thought was why can't this be handled by LSR, but I can see how that might not work very well trying to precisely match VLD offsets. And the tests you added have no loops :)

(I also had some thoughts about whether this was useful in general, or if a sufficiently powerful cpu would break these into microops in either case, leading to the same performance in the end. But the code does look cleaner now, I can see how it would improve things)

The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.

It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here? Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.

llvm/test/CodeGen/ARM/arm-post-indexing-opt.ll
1–2	It is best not to mix llc and opt test files. They are easier kept as separate tests, with autogenerated check lines in each. It's also good to show before and after in the review by pre-committing the tests, which makes the review easier by making it obvious what has changed.

In D108988#2979176, @dmgreen wrote:

My first thought was why can't this be handled by LSR, but I can see how that might not work very well trying to precisely match VLD offsets. And the tests you added have no loops :)

The patch is focused on sequential loads that have the same base
pointer and constant offsets, but it can also work if such sequence is
in a loop body:

void test(float *a, float *b, int n) {
  for (int i = 0; i < n; ++i) {
    v4f32 A1 = vld1q_f32(a + 16 * i);
    v4f32 A2 = vld1q_f32(a + 16 * i + 4);
    v4f32 A3 = vld1q_f32(a + 16 * i + 8);
    v4f32 A4 = vld1q_f32(a + 16 * i + 12);
    vst1q_f32(b + 4 * i, A1);
    vst1q_f32(b + 4 * i, A2);
    vst1q_f32(b + 4 * i, A3);
    vst1q_f32(b + 4 * i, A4);
  }

LSR seems to only handle values that are loop IV, so these constant
offsets are not optimized. The loop body is compiled to:

add     lr, r0, r3
subs    r2, r2, #1
mov     r4, lr
vld1.32 {d16, d17}, [r4], r12
vld1.32 {d18, d19}, [r4]
add     r4, lr, #32              ; <-- extra address computation
vld1.32 {d20, d21}, [r4]
add     r4, lr, #16              ; <--
vld1.32 {d22, d23}, [r4]
add     r4, r1, r3
add     r5, r4, #16              ; <--
add     r3, r3, #64
mov     lr, r4
add     r4, r4, #32              ; <--
vst1.32 {d16, d17}, [lr], r12
vst1.32 {d22, d23}, [r5]
vst1.32 {d20, d21}, [r4]
vst1.32 {d18, d19}, [lr]
bne     .LBB0_2

In the first revision of this patch ARMPostIndexingOpt was confused by
GEP patterns produced by LSR. This is now fixed and the sequence is
optimized to:

add     r3, r0, r12
subs    r2, r2, #1
vld1.32 {d16, d17}, [r3]!
vld1.32 {d18, d19}, [r3]!
vld1.32 {d20, d21}, [r3]!
vld1.32 {d22, d23}, [r3]
add     r3, r1, r12
add     r12, r12, #64
vst1.32 {d16, d17}, [r3]!
vst1.32 {d18, d19}, [r3]!
vst1.32 {d20, d21}, [r3]!
vst1.32 {d22, d23}, [r3]
bne     .LBB0_2

(I also had some thoughts about whether this was useful in general, or if a sufficiently powerful cpu would break these into microops in either case, leading to the same performance in the end. But the code does look cleaner now, I can see how it would improve things)

I've measured execution time of the loop above, and it is ~7% faster
on Cortex-A72. It may be different on other hardware though.

The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.

It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here?

I think the approach is still the same, the new pass just works for
cases that LSR does not handle.

Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.

Agree, this is potentially the most problematic change. It is limited
to just (load/store (add)) and works only before legalization, so this
/hopefully/ reduces its impact to just the patterns we need.

Added handling for GEP patterns generated by LSR.
Split llc and opt LIT tests.
Pre-committed llc LIT test.

Harbormaster completed remote builds in B123095: Diff 371409.Sep 8 2021, 11:55 AM

The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.

It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here?

I think the approach is still the same, the new pass just works for
cases that LSR does not handle.

The MVE method works quite differently I feel. It fixes up problems later in the pipeline at the mir level, not trying to get them perfectly correct before ISel. It can be awkward dealing with mir though, and difficult to look through all the instructions that can be generated looking at ways to distribute postincs more evenly if the results are not already close enough. I don't think I would recommend it for this problem, it would probably be simpler to fix it up in the DAG than trying to do it later in MIR.

Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.

Agree, this is potentially the most problematic change. It is limited
to just (load/store (add)) and works only before legalization, so this
/hopefully/ reduces its impact to just the patterns we need.

Unfortunately I'm not sure we can say this in general, even if it is quite rare. I feel like having this reassociationCanBreakPostIndexingPattern method in the generic DAG combine code disabling so many of the ADD folds is not a good sign. It feels like it's working around the fact that this is a quite fragile way of trying to get postinc working. The main issues I found I think could be fixed with one use checks in reassociationCanBreakPostIndexingPattern, but is there a way to make this work without disabling the generic folds? Perhaps by adding new folds for fixing postinc patterns, and getting "add-like-ors" to behave like adds in more places?

is there a way to make this work without disabling the generic folds? Perhaps by adding new folds for fixing postinc patterns, and getting "add-like-ors" to behave like adds in more places?

Done. Everything is moved to DAGCombine now, so the new pass is not required.
This allows to catch more cases, and not require any changes around
ADD-to-OR tranformation.

Code from CombineBaseUpdate is moved to TryCombineBaseUpdate without
major changes, but now we look for more candidates (pointer updates)
and try to pick the best one.

Herald added subscribers: steven.zhang, mgrang. · View Herald TranscriptOct 11 2021, 4:12 AM

Harbormaster completed remote builds in B128075: Diff 378619.Oct 11 2021, 4:13 AM

Nice work. I'm glad this works this way.

llvm/lib/Target/ARM/ARMISelLowering.cpp
15207	How useful are these error messages do you think, in the long run? The number of combines tried can be quite high, and these may just end up adding noise for people not looking at CombineBaseUpdate combines. It seems to be fairly uncommon to add debug messages for DAG combines. But if you think they are useful, feel free to keep them around.
15522	Can this use haveNoCommonBitsSet, similar to ARMDAGToDAGISel::SelectAddLikeOr?
15540	Do we need to check both operands? The Add/Or should be canonicalized to have the constant on the RHS.
llvm/test/CodeGen/ARM/alloc-no-stack-realign.ll
19 ↗	(On Diff #378619)	How come this test is changing?

Used DAG.haveNoCommonBitsSet to check for ADD-like OR.
Removed handling for non-canonical DAG.
Removed extra debug messages.

Harbormaster completed remote builds in B128385: Diff 379057.Oct 12 2021, 8:53 AM

asavonic added inline comments.Oct 12 2021, 8:54 AM

llvm/lib/Target/ARM/ARMISelLowering.cpp
15207	Let's remove them. They were useful for debugging, but standard messages from DAG combiner are good enough.
15522	Done.
15540	Thanks. I was not sure that we can expect that. Removed the extra check.
llvm/test/CodeGen/ARM/alloc-no-stack-realign.ll
19 ↗	(On Diff #378619)	Oh, this one is tricky. The original test expected to see an OR instruction when the stack is aligned, and an ADD instruction when it is not. After the patch both ADD and OR are folded with loads/stores, so the two sequences (test1 and test2) are completely identical. I changed the IR slightly, so that OR is not folded.

Thanks. LGTM

llvm/lib/Target/ARM/ARMISelLowering.cpp
15404	LLVM tends to leave the brackets off single statement if blocks.

This revision is now accepted and ready to land.Oct 14 2021, 12:23 AM

This revision was landed with ongoing or failed builds.Oct 14 2021, 5:26 AM

Closed by commit rGdc8a41de3493: [ARM] Simplify address calculation for NEON load/store (authored by asavonic). · Explain Why

This revision was automatically updated to reflect the committed changes.

asavonic added a commit: rGdc8a41de3493: [ARM] Simplify address calculation for NEON load/store.

asavonic added inline comments.Oct 14 2021, 5:27 AM

llvm/lib/Target/ARM/ARMISelLowering.cpp
15404	Thank you. Fixed that before landing.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

9 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

39 lines

Target/

ARM/

ARM.h

2 lines

ARMISelLowering.h

5 lines

ARMISelLowering.cpp

21 lines

ARMPostIndexingOptimizer.cpp

335 lines

ARMTargetMachine.cpp

2 lines

CMakeLists.txt

1 line

test/

CodeGen/

ARM/

O3-pipeline.ll

1 line

arm-post-indexing-opt-ir.ll

354 lines

arm-post-indexing-opt.ll

116 lines

misched-fusion-aes.ll

9 lines

Transforms/

LoopStrengthReduce/

ARM/

ivchain-ARM.ll

12 lines

Diff 371409

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 2,375 Lines • ▼ Show 20 Lines	public:
///		///
/// If the address space cannot be determined, it will be -1.		/// If the address space cannot be determined, it will be -1.
///		///
/// TODO: Remove default argument		/// TODO: Remove default argument
virtual bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM,		virtual bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM,
Type *Ty, unsigned AddrSpace,		Type *Ty, unsigned AddrSpace,
Instruction *I = nullptr) const;		Instruction *I = nullptr) const;

		/// Return true if it is beneficial to retain post-indexing-friendly patterns
		/// while performing optimizations.
		virtual bool shouldRetainImmediatePostIncrement(const DataLayout &DL,
		Type *Ty, CombineLevel Level,
		unsigned AddrSpace,
		int64_t Increment) const {
		return false;
		}

/// Return the cost of the scaling factor used in the addressing mode		/// Return the cost of the scaling factor used in the addressing mode
/// represented by AM for this target, for a load/store of the specified type.		/// represented by AM for this target, for a load/store of the specified type.
///		///
/// If the AM is supported, the return value must be >= 0.		/// If the AM is supported, the return value must be >= 0.
/// If the AM is not supported, it returns a negative value.		/// If the AM is not supported, it returns a negative value.
/// TODO: Handle pre/postinc as well.		/// TODO: Handle pre/postinc as well.
/// TODO: Remove default argument		/// TODO: Remove default argument
virtual InstructionCost getScalingFactorCost(const DataLayout &DL,		virtual InstructionCost getScalingFactorCost(const DataLayout &DL,
▲ Show 20 Lines • Show All 2,323 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 518 Lines • ▼ Show 20 Lines	private:
SDValue visitFADDForFMACombine(SDNode *N);		SDValue visitFADDForFMACombine(SDNode *N);
SDValue visitFSUBForFMACombine(SDNode *N);		SDValue visitFSUBForFMACombine(SDNode *N);
SDValue visitFMULForFMADistributiveCombine(SDNode *N);		SDValue visitFMULForFMADistributiveCombine(SDNode *N);

SDValue XformToShuffleWithZero(SDNode *N);		SDValue XformToShuffleWithZero(SDNode *N);
bool reassociationCanBreakAddressingModePattern(unsigned Opc,		bool reassociationCanBreakAddressingModePattern(unsigned Opc,
const SDLoc &DL, SDValue N0,		const SDLoc &DL, SDValue N0,
SDValue N1);		SDValue N1);
		bool reassociationCanBreakPostIndexingPattern(SDNode *N);
SDValue reassociateOpsCommutative(unsigned Opc, const SDLoc &DL, SDValue N0,		SDValue reassociateOpsCommutative(unsigned Opc, const SDLoc &DL, SDValue N0,
SDValue N1);		SDValue N1);
SDValue reassociateOps(unsigned Opc, const SDLoc &DL, SDValue N0,		SDValue reassociateOps(unsigned Opc, const SDLoc &DL, SDValue N0,
SDValue N1, SDNodeFlags Flags);		SDValue N1, SDNodeFlags Flags);

SDValue visitShiftByConstant(SDNode *N);		SDValue visitShiftByConstant(SDNode *N);

SDValue foldSelectOfConstants(SDNode *N);		SDValue foldSelectOfConstants(SDNode *N);
▲ Show 20 Lines • Show All 513 Lines • ▼ Show 20 Lines	if (LoadStore) {
if (!TLI.isLegalAddressingMode(DAG.getDataLayout(), AM, AccessTy, AS))		if (!TLI.isLegalAddressingMode(DAG.getDataLayout(), AM, AccessTy, AS))
return true;		return true;
}		}
}		}

return false;		return false;
}		}

		bool DAGCombiner::reassociationCanBreakPostIndexingPattern(SDNode *N) {
		const DataLayout &DL = DAG.getDataLayout();
		if (N->getOpcode() != ISD::ADD)
		return false;

		auto Const = dyn_cast<ConstantSDNode>(N->getOperand(1));
		if (!Const)
		return false;

		const APInt &APIntVal = Const->getAPIntValue();
		if (APIntVal.getBitWidth() > 64)
		return false;
		const int64_t ConstValue = APIntVal.getSExtValue();

		// Check for (load/store (add x, const))

		for (SDNode *Node : N->getOperand(0)->uses()) {
		auto LoadStore = dyn_cast<MemSDNode>(Node);
		if (!LoadStore)
		continue;

		EVT VT = LoadStore->getMemoryVT();
		Type AccessTy = VT.getTypeForEVT(DAG.getContext());
		unsigned AS = LoadStore->getAddressSpace();
		if (TLI.shouldRetainImmediatePostIncrement(DL, AccessTy, Level, AS,
		ConstValue))
		return true;
		}
		return false;
		}

// Helper for DAGCombiner::reassociateOps. Try to reassociate an expression		// Helper for DAGCombiner::reassociateOps. Try to reassociate an expression
// such as (Opc N0, N1), if \p N0 is the same kind of operation as \p Opc.		// such as (Opc N0, N1), if \p N0 is the same kind of operation as \p Opc.
SDValue DAGCombiner::reassociateOpsCommutative(unsigned Opc, const SDLoc &DL,		SDValue DAGCombiner::reassociateOpsCommutative(unsigned Opc, const SDLoc &DL,
SDValue N0, SDValue N1) {		SDValue N0, SDValue N1) {
EVT VT = N0.getValueType();		EVT VT = N0.getValueType();

if (N0.getOpcode() != Opc)		if (N0.getOpcode() != Opc)
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 1,419 Lines • ▼ Show 20 Lines
}		}

SDValue DAGCombiner::visitADD(SDNode *N) {		SDValue DAGCombiner::visitADD(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
EVT VT = N0.getValueType();		EVT VT = N0.getValueType();
SDLoc DL(N);		SDLoc DL(N);

		// Prevent ADD reassociation as well as converting ADD -> OR
		if (reassociationCanBreakPostIndexingPattern(N) \|\|
		reassociationCanBreakPostIndexingPattern(N0.getNode()) \|\|
		reassociationCanBreakPostIndexingPattern(N1.getNode())) {
		return SDValue();
		}

if (SDValue Combined = visitADDLike(N))		if (SDValue Combined = visitADDLike(N))
return Combined;		return Combined;

if (SDValue V = foldAddSubBoolOfMaskedVal(N, DAG))		if (SDValue V = foldAddSubBoolOfMaskedVal(N, DAG))
return V;		return V;

if (SDValue V = foldAddSubOfSignBit(N, DAG))		if (SDValue V = foldAddSubOfSignBit(N, DAG))
return V;		return V;
▲ Show 20 Lines • Show All 21,088 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARM.h

	Show All 37 Lines
	Pass *createMVETailPredicationPass();			Pass *createMVETailPredicationPass();
	FunctionPass *createARMLowOverheadLoopsPass();			FunctionPass *createARMLowOverheadLoopsPass();
	FunctionPass *createARMBlockPlacementPass();			FunctionPass *createARMBlockPlacementPass();
	Pass *createARMParallelDSPPass();			Pass *createARMParallelDSPPass();
	FunctionPass *createARMISelDag(ARMBaseTargetMachine &TM,			FunctionPass *createARMISelDag(ARMBaseTargetMachine &TM,
	CodeGenOpt::Level OptLevel);			CodeGenOpt::Level OptLevel);
	FunctionPass *createA15SDOptimizerPass();			FunctionPass *createA15SDOptimizerPass();
	FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);			FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);
				FunctionPass *createARMPostIndexingOptimizationPass();
	FunctionPass *createARMExpandPseudoPass();			FunctionPass *createARMExpandPseudoPass();
	FunctionPass *createARMConstantIslandPass();			FunctionPass *createARMConstantIslandPass();
	FunctionPass *createMLxExpansionPass();			FunctionPass *createMLxExpansionPass();
	FunctionPass *createThumb2ITBlockPass();			FunctionPass *createThumb2ITBlockPass();
	FunctionPass *createMVEVPTBlockPass();			FunctionPass *createMVEVPTBlockPass();
	FunctionPass *createMVETPAndVPTOptimisationsPass();			FunctionPass *createMVETPAndVPTOptimisationsPass();
	FunctionPass *createARMOptimizeBarriersPass();			FunctionPass *createARMOptimizeBarriersPass();
	FunctionPass *createThumb2SizeReductionPass(			FunctionPass *createThumb2SizeReductionPass(
	std::function<bool(const Function &)> Ftor = nullptr);			std::function<bool(const Function &)> Ftor = nullptr);
	InstructionSelector *			InstructionSelector *
	createARMInstructionSelector(const ARMBaseTargetMachine &TM, const ARMSubtarget &STI,			createARMInstructionSelector(const ARMBaseTargetMachine &TM, const ARMSubtarget &STI,
	const ARMRegisterBankInfo &RBI);			const ARMRegisterBankInfo &RBI);
	Pass *createMVEGatherScatterLoweringPass();			Pass *createMVEGatherScatterLoweringPass();
	FunctionPass *createARMSLSHardeningPass();			FunctionPass *createARMSLSHardeningPass();
	FunctionPass *createARMIndirectThunks();			FunctionPass *createARMIndirectThunks();
	Pass *createMVELaneInterleavingPass();			Pass *createMVELaneInterleavingPass();

	void LowerARMMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,			void LowerARMMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,
	ARMAsmPrinter &AP);			ARMAsmPrinter &AP);

	void initializeARMParallelDSPPass(PassRegistry &);			void initializeARMParallelDSPPass(PassRegistry &);
	void initializeARMLoadStoreOptPass(PassRegistry &);			void initializeARMLoadStoreOptPass(PassRegistry &);
				void initializeARMPostIndexingOptPass(PassRegistry &);
	void initializeARMPreAllocLoadStoreOptPass(PassRegistry &);			void initializeARMPreAllocLoadStoreOptPass(PassRegistry &);
	void initializeARMConstantIslandsPass(PassRegistry &);			void initializeARMConstantIslandsPass(PassRegistry &);
	void initializeARMExpandPseudoPass(PassRegistry &);			void initializeARMExpandPseudoPass(PassRegistry &);
	void initializeThumb2SizeReducePass(PassRegistry &);			void initializeThumb2SizeReducePass(PassRegistry &);
	void initializeThumb2ITBlockPass(PassRegistry &);			void initializeThumb2ITBlockPass(PassRegistry &);
	void initializeMVEVPTBlockPass(PassRegistry &);			void initializeMVEVPTBlockPass(PassRegistry &);
	void initializeMVETPAndVPTOptimisationsPass(PassRegistry &);			void initializeMVETPAndVPTOptimisationsPass(PassRegistry &);
	void initializeARMLowOverheadLoopsPass(PassRegistry &);			void initializeARMLowOverheadLoopsPass(PassRegistry &);
	Show All 9 Lines

llvm/lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 463 Lines • ▼ Show 20 Lines	public:


/// isLegalAddressingMode - Return true if the addressing mode represented		/// isLegalAddressingMode - Return true if the addressing mode represented
/// by AM is legal for this target, for a load/store of the specified type.		/// by AM is legal for this target, for a load/store of the specified type.
bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM,		bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM,
Type *Ty, unsigned AS,		Type *Ty, unsigned AS,
Instruction *I = nullptr) const override;		Instruction *I = nullptr) const override;

		bool shouldRetainImmediatePostIncrement(const DataLayout &DL, Type *Ty,
		CombineLevel Level,
		unsigned AddrSpace,
		int64_t Increment) const override;

/// getScalingFactorCost - Return the cost of the scaling used in		/// getScalingFactorCost - Return the cost of the scaling used in
/// addressing mode represented by AM.		/// addressing mode represented by AM.
/// If the AM is supported, the return value must be >= 0.		/// If the AM is supported, the return value must be >= 0.
/// If the AM is not supported, the return value must be negative.		/// If the AM is not supported, the return value must be negative.
InstructionCost getScalingFactorCost(const DataLayout &DL,		InstructionCost getScalingFactorCost(const DataLayout &DL,
const AddrMode &AM, Type *Ty,		const AddrMode &AM, Type *Ty,
unsigned AS) const override;		unsigned AS) const override;

▲ Show 20 Lines • Show All 509 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 15,198 Lines • ▼ Show 20 Lines	static SDValue CombineBaseUpdate(SDNode *N,
const unsigned AddrOpIdx = ((isIntrinsic \|\| isStore) ? 2 : 1);		const unsigned AddrOpIdx = ((isIntrinsic \|\| isStore) ? 2 : 1);
SDValue Addr = N->getOperand(AddrOpIdx);		SDValue Addr = N->getOperand(AddrOpIdx);
MemSDNode *MemN = cast<MemSDNode>(N);		MemSDNode *MemN = cast<MemSDNode>(N);
SDLoc dl(N);		SDLoc dl(N);

// Search for a use of the address operand that is an increment.		// Search for a use of the address operand that is an increment.
for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),		for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
UE = Addr.getNode()->use_end(); UI != UE; ++UI) {		UE = Addr.getNode()->use_end(); UI != UE; ++UI) {
SDNode User = UI;		SDNode User = UI;
		dmgreenUnsubmitted Not Done Reply Inline Actions How useful are these error messages do you think, in the long run? The number of combines tried can be quite high, and these may just end up adding noise for people not looking at CombineBaseUpdate combines. It seems to be fairly uncommon to add debug messages for DAG combines. But if you think they are useful, feel free to keep them around. dmgreen: How useful are these error messages do you think, in the long run? The number of combines…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Let's remove them. They were useful for debugging, but standard messages from DAG combiner are good enough. asavonic: Let's remove them. They were useful for debugging, but standard messages from DAG combiner are…
if (User->getOpcode() != ISD::ADD \|\|		if (User->getOpcode() != ISD::ADD \|\|
UI.getUse().getResNo() != Addr.getResNo())		UI.getUse().getResNo() != Addr.getResNo())
continue;		continue;

// Check that the add is independent of the load/store. Otherwise, folding		// Check that the add is independent of the load/store. Otherwise, folding
// it would create a cycle. We can avoid searching through Addr as it's a		// it would create a cycle. We can avoid searching through Addr as it's a
// predecessor to both.		// predecessor to both.
SmallPtrSet<const SDNode *, 32> Visited;		SmallPtrSet<const SDNode *, 32> Visited;
▲ Show 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
Ops.push_back(DAG.getConstant(Alignment, dl, MVT::i32));		Ops.push_back(DAG.getConstant(Alignment, dl, MVT::i32));

// If this is a non-standard-aligned STORE, the penultimate operand is the		// If this is a non-standard-aligned STORE, the penultimate operand is the
// stored value. Bitcast it to the aligned type.		// stored value. Bitcast it to the aligned type.
if (AlignedVecTy != VecTy && N->getOpcode() == ISD::STORE) {		if (AlignedVecTy != VecTy && N->getOpcode() == ISD::STORE) {
SDValue &StVal = Ops[Ops.size()-2];		SDValue &StVal = Ops[Ops.size()-2];
StVal = DAG.getNode(ISD::BITCAST, dl, AlignedVecTy, StVal);		StVal = DAG.getNode(ISD::BITCAST, dl, AlignedVecTy, StVal);
}		}

		dmgreenUnsubmitted Not Done Reply Inline Actions LLVM tends to leave the brackets off single statement if blocks. dmgreen: LLVM tends to leave the brackets off single statement if blocks.
		asavonicAuthorUnsubmitted Done Reply Inline Actions Thank you. Fixed that before landing. asavonic: Thank you. Fixed that before landing.
EVT LoadVT = isLaneOp ? VecTy.getVectorElementType() : AlignedVecTy;		EVT LoadVT = isLaneOp ? VecTy.getVectorElementType() : AlignedVecTy;
SDValue UpdN = DAG.getMemIntrinsicNode(NewOpc, dl, SDTys, Ops, LoadVT,		SDValue UpdN = DAG.getMemIntrinsicNode(NewOpc, dl, SDTys, Ops, LoadVT,
MemN->getMemOperand());		MemN->getMemOperand());

// Update the uses.		// Update the uses.
SmallVector<SDValue, 5> NewResults;		SmallVector<SDValue, 5> NewResults;
for (unsigned i = 0; i < NumResultVecs; ++i)		for (unsigned i = 0; i < NumResultVecs; ++i)
NewResults.push_back(SDValue(UpdN.getNode(), i));		NewResults.push_back(SDValue(UpdN.getNode(), i));
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
unsigned NumBytes = NumVecs * VecTy.getSizeInBits() / 8;		unsigned NumBytes = NumVecs * VecTy.getSizeInBits() / 8;

// If the increment is a constant, it must match the memory ref size.		// If the increment is a constant, it must match the memory ref size.
SDValue Inc = User->getOperand(User->getOperand(0) == Addr ? 1 : 0);		SDValue Inc = User->getOperand(User->getOperand(0) == Addr ? 1 : 0);
ConstantSDNode *CInc = dyn_cast<ConstantSDNode>(Inc.getNode());		ConstantSDNode *CInc = dyn_cast<ConstantSDNode>(Inc.getNode());
if (!CInc \|\| CInc->getZExtValue() != NumBytes)		if (!CInc \|\| CInc->getZExtValue() != NumBytes)
continue;		continue;

// Create the new updating load/store node.		// Create the new updating load/store node.
		dmgreenUnsubmitted Not Done Reply Inline Actions Can this use haveNoCommonBitsSet, similar to ARMDAGToDAGISel::SelectAddLikeOr? dmgreen: Can this use haveNoCommonBitsSet, similar to ARMDAGToDAGISel::SelectAddLikeOr?
		asavonicAuthorUnsubmitted Done Reply Inline Actions Done. asavonic: Done.
// First, create an SDVTList for the new updating node's results.		// First, create an SDVTList for the new updating node's results.
EVT Tys[6];		EVT Tys[6];
unsigned NumResultVecs = (isLoadOp ? NumVecs : 0);		unsigned NumResultVecs = (isLoadOp ? NumVecs : 0);
unsigned n;		unsigned n;
for (n = 0; n < NumResultVecs; ++n)		for (n = 0; n < NumResultVecs; ++n)
Tys[n] = VecTy;		Tys[n] = VecTy;
Tys[n++] = MVT::i32;		Tys[n++] = MVT::i32;
Tys[n] = MVT::Other;		Tys[n] = MVT::Other;
SDVTList SDTys = DAG.getVTList(makeArrayRef(Tys, NumResultVecs + 2));		SDVTList SDTys = DAG.getVTList(makeArrayRef(Tys, NumResultVecs + 2));

// Then, gather the new node's operands.		// Then, gather the new node's operands.
SmallVector<SDValue, 8> Ops;		SmallVector<SDValue, 8> Ops;
Ops.push_back(N->getOperand(0)); // incoming chain		Ops.push_back(N->getOperand(0)); // incoming chain
Ops.push_back(N->getOperand(2)); // ptr		Ops.push_back(N->getOperand(2)); // ptr
Ops.push_back(Inc);		Ops.push_back(Inc);

for (unsigned i = 3; i < N->getNumOperands(); ++i)		for (unsigned i = 3; i < N->getNumOperands(); ++i)
Ops.push_back(N->getOperand(i));		Ops.push_back(N->getOperand(i));
		dmgreenUnsubmitted Not Done Reply Inline Actions Do we need to check both operands? The Add/Or should be canonicalized to have the constant on the RHS. dmgreen: Do we need to check both operands? The Add/Or should be canonicalized to have the constant on…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks. I was not sure that we can expect that. Removed the extra check. asavonic: Thanks. I was not sure that we can expect that. Removed the extra check.

SDValue UpdN = DAG.getMemIntrinsicNode(NewOpc, dl, SDTys, Ops, VecTy,		SDValue UpdN = DAG.getMemIntrinsicNode(NewOpc, dl, SDTys, Ops, VecTy,
MemN->getMemOperand());		MemN->getMemOperand());

// Update the uses.		// Update the uses.
SmallVector<SDValue, 5> NewResults;		SmallVector<SDValue, 5> NewResults;
for (unsigned i = 0; i < NumResultVecs; ++i)		for (unsigned i = 0; i < NumResultVecs; ++i)
NewResults.push_back(SDValue(UpdN.getNode(), i));		NewResults.push_back(SDValue(UpdN.getNode(), i));
▲ Show 20 Lines • Show All 3,153 Lines • ▼ Show 20 Lines	case MVT::isVoid:
// Allow r << imm, but the imm has to be a multiple of two.		// Allow r << imm, but the imm has to be a multiple of two.
if (Scale & 1) return false;		if (Scale & 1) return false;
return isPowerOf2_32(Scale);		return isPowerOf2_32(Scale);
}		}
}		}
return true;		return true;
}		}

		bool ARMTargetLowering::shouldRetainImmediatePostIncrement(
		const DataLayout &DL, Type *Ty, CombineLevel Level, unsigned AddrSpace,
		int64_t Increment) const {
		// NEON has rather restricted address calculation for vector load / store
		// instructions compared to MVE or AArch64 ASIMD.
		if (Subtarget->hasMVEIntegerOps())
		return false;

		// If the first DAG optimization pass did not consume this increment,
		// try combining as usual during subsequent optimization passes.
		if (Level != CombineLevel::BeforeLegalizeTypes)
		return false;

		if (!Ty->isVectorTy())
		return false;

		unsigned BitSize = DL.getTypeSizeInBits(Ty);

		return BitSize > 64 && isPowerOf2_32(BitSize);
		}

/// isLegalICmpImmediate - Return true if the specified immediate is legal		/// isLegalICmpImmediate - Return true if the specified immediate is legal
/// icmp immediate, that is the target has icmp instructions which can compare		/// icmp immediate, that is the target has icmp instructions which can compare
/// a register against the immediate without having to materialize the		/// a register against the immediate without having to materialize the
/// immediate into a register.		/// immediate into a register.
bool ARMTargetLowering::isLegalICmpImmediate(int64_t Imm) const {		bool ARMTargetLowering::isLegalICmpImmediate(int64_t Imm) const {
// Thumb2 and ARM modes can use cmn for negative immediates.		// Thumb2 and ARM modes can use cmn for negative immediates.
if (!Subtarget->isThumb())		if (!Subtarget->isThumb())
return ARM_AM::getSOImmVal((uint32_t)Imm) != -1 \|\|		return ARM_AM::getSOImmVal((uint32_t)Imm) != -1 \|\|
▲ Show 20 Lines • Show All 2,312 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMPostIndexingOptimizer.cpp

This file was added.

				//===- ARMPostIndexingOptimizer.cpp - Prepare ld/st for post-indexing -----===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file ARMPostIndexingOpt pass transforms address operands of load/store
				/// instructions to allow them to be emitted as NEON load/store with post-index
				/// addressing mode. For example:
				///
				/// %first = gep %base, %offset1
				/// load %first
				/// %second = gep %base, %offset2
				/// load %second
				///
				/// this sequence may be tranformed into:
				///
				/// %first = gep %base, %offset1
				/// load %first
				/// %second = gep %first, (%offset2 - %offset1)
				/// load %second
				///
				/// The transformation is done only if:
				/// 1) GEPs are constant
				/// 2) Difference between offsets is compatible with either "[Rn]!"
				/// or "[Rn], Rm" addressing modes.
				//
				//===----------------------------------------------------------------------===//

				#include "ARM.h"
				#include "ARMSubtarget.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/Analysis/TargetLibraryInfo.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicsARM.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Target/TargetMachine.h"
				#include "llvm/Transforms/Utils/Local.h"

				using namespace llvm;

				#define DEBUG_TYPE "arm-post-indexing-opt"
				#define PASS_DESC "ARM post-indexed access optimizer"

				namespace {

				// LdStInfo the base address of a load or store, and an immediate offset of a
				// memory access obtained by a best-effort heuristic as well as other useful
				// instruction properties.
				struct LdStInfo {
				LdStInfo(const DataLayout &DL, Instruction *LdSt, unsigned BaseOperandIndex,
				int AccessSize);
				// A memory access in question
				Instruction *LdSt;
				// An actual operand of LdSt can be updated throughout this pass execution,
				// so store an index instead
				unsigned BaseOperandIndex;
				// A guessed base address
				Value *IndirectBase;
				// An immediate offset to add to IndirectBase
				int32_t Offset;
				// An access size that can be used for post-indexed addressing mode
				int AccessSize;

				// Returns current direct base operand
				Value *getBaseOperand() { return LdSt->getOperand(BaseOperandIndex); }
				};

				struct ARMPostIndexingOpt : public FunctionPass {
				static char ID;

				ARMPostIndexingOpt() : FunctionPass(ID) {}

				StringRef getPassName() const override { return PASS_DESC; }

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<TargetLibraryInfoWrapperPass>();
				AU.addRequired<TargetPassConfig>();
				AU.setPreservesCFG();
				}

				bool runOnFunction(Function &F) override;
				};

				} // end anonymous namespace

				char ARMPostIndexingOpt::ID = 0;
				INITIALIZE_PASS(ARMPostIndexingOpt, DEBUG_TYPE, PASS_DESC, false, false)

				// Returns (nullptr, 0) for instructions not handled by this pass
				static std::pair<Type , unsigned> getDataTypeAndBaseIndex(Instruction I) {
				if (LoadInst *Load = dyn_cast<LoadInst>(I))
				return std::make_pair(Load->getType(), 0);
				if (StoreInst *Store = dyn_cast<StoreInst>(I))
				return std::make_pair(Store->getValueOperand()->getType(), 1);

				if (IntrinsicInst *Intrinsic = dyn_cast<IntrinsicInst>(I)) {
				switch (Intrinsic->getIntrinsicID()) {
				case Intrinsic::arm_neon_vld1:
				case Intrinsic::arm_neon_vld2:
				case Intrinsic::arm_neon_vld3:
				case Intrinsic::arm_neon_vld4:
				return std::make_pair(Intrinsic->getType(), 0);
				case Intrinsic::arm_neon_vst1:
				case Intrinsic::arm_neon_vst2:
				case Intrinsic::arm_neon_vst3:
				case Intrinsic::arm_neon_vst4:
				return std::make_pair(Intrinsic->getOperand(1)->getType(), 0);
				default:
				break;
				}
				}
				return std::make_pair(nullptr, 0);
				}

				LdStInfo::LdStInfo(const DataLayout &DL, Instruction *LdSt,
				unsigned BaseOperandIndex, int AccessSize)
				: LdSt(LdSt), BaseOperandIndex(BaseOperandIndex), Offset(0),
				AccessSize(AccessSize) {
				IndirectBase = LdSt->getOperand(BaseOperandIndex);
				for (;;) {
				IndirectBase = IndirectBase->stripPointerCasts();
				// Match GetElementPtrInst as well as corresponding ContantExpr
				if (auto *GEP = dyn_cast<GEPOperator>(IndirectBase)) {
				APInt APOffset(32, 0, /* isSigned = */ true);
				if (GEP->accumulateConstantOffset(DL, APOffset)) {
				IndirectBase = GEP->getPointerOperand();
				Offset += APOffset.getSExtValue();
				continue;
				}
				}
				return;
				}
				}

				// Guess a common stride (aside from post-incrementing by access size) that
				// is suitable for "[Rn], Rm" addressing mode, if any
				static int32_t guessCustomAccessStride(ArrayRef<LdStInfo> Instructions) {
				int32_t Stride = 0; // not decided
				assert(!Instructions.empty());
				for (auto I = Instructions.begin(), End = Instructions.end();
				std::next(I) != End; ++I) {
				int32_t ThisStride = std::next(I)->Offset - I->Offset;
				// Check if "[Rn]!" addressing mode can be used
				if (ThisStride == I->AccessSize)
				continue;
				// If this is the first instruction requiring a register operand,
				// request this stride value
				if (Stride == 0)
				Stride = ThisStride;
				// If multiple different stride values have to be used,
				// conservatively refrain from using "[Rn], Rm" addressing mode
				if (Stride != ThisStride)
				return 0;
				}
				return Stride;
				}

				// Rewrite a memory address used by Second to use address of First incremented
				// by a constant value
				static bool rewriteAddressCalculation(LdStInfo &First, LdStInfo &Second,
				int32_t RegStride,
				const TargetLibraryInfo &TLI) {
				LLVM_DEBUG(dbgs() << "Rewriting load/store for post-indexing: ";
				Second.LdSt->dump());

				IRBuilder<> IRB(Second.LdSt);
				const DataLayout &DL = Second.LdSt->getModule()->getDataLayout();

				int32_t Stride = Second.Offset - First.Offset;
				if (Stride != First.AccessSize && Stride != RegStride)
				return false;

				// In case GEPOperand matched ContantExpr, replace it by instruction to
				// prevent folding
				if (auto Const = dyn_cast<ConstantExpr>(First.getBaseOperand())) {
				auto Inst = Const->getAsInstruction();
				Inst->insertBefore(First.LdSt);
				First.LdSt->replaceUsesOfWith(Const, Inst);
				}

				Value *FirstBase = First.getBaseOperand();
				Value *OldSecondBase = Second.getBaseOperand();
				PointerType *FirstBaseTy = cast<PointerType>(FirstBase->getType());
				PointerType *SecondBaseTy = cast<PointerType>(OldSecondBase->getType());
				assert(FirstBaseTy->getAddressSpace() == 0 && "Unexpected address space");
				assert(SecondBaseTy->getAddressSpace() == 0 && "Unexpected address space");

				int32_t FirstElementSize =
				DL.getTypeSizeInBits(FirstBaseTy->getElementType()) / 8;

				Value *NewSecondBase;
				if (FirstBaseTy == SecondBaseTy && Stride % FirstElementSize == 0) {
				int32_t ElementStride = Stride / FirstElementSize;
				Type *EltTy = FirstBaseTy->getPointerElementType();
				NewSecondBase =
				IRB.CreateConstGEP1_32(EltTy, FirstBase, ElementStride, "postinc");
				} else {
				Value *FirstBaseBytePtr =
				IRB.CreateBitCast(FirstBase, IRB.getInt8PtrTy(), "oldbase.byteptr");
				Value *NewSecondBaseBytePtr = IRB.CreateConstGEP1_32(
				IRB.getInt8Ty(), FirstBaseBytePtr, Stride, "postinc.byteptr");
				NewSecondBase =
				IRB.CreateBitCast(NewSecondBaseBytePtr, SecondBaseTy, "postinc");
				}
				Second.LdSt->replaceUsesOfWith(OldSecondBase, NewSecondBase);
				LLVM_DEBUG(dbgs() << "New load/store: "; Second.LdSt->dump());
				RecursivelyDeleteTriviallyDeadInstructions(OldSecondBase, &TLI, nullptr);
				return true;
				}

				static bool isProfitable(const SmallVectorImpl<LdStInfo> &Instructions,
				int32_t RegStride) {
				if (Instructions.size() < 2)
				return false;

				unsigned Matches = 1; // start from 1 since we look for pairs
				for (auto I = Instructions.begin(), End = Instructions.end();
				std::next(I) != End; ++I) {
				int32_t Stride = std::next(I)->Offset - I->Offset;
				if (Stride == I->AccessSize \|\| Stride == RegStride)
				Matches++;
				}

				if (Matches < 4 && Instructions.size() > Matches) {
				// Bail out if there are other users of the base pointer and not a
				// lot of consecutive accesses.
				return false;
				}
				return true;
				}

				static void refineIndirectBase(SmallVectorImpl<LdStInfo> &LdSt) {
				// Match the following pattern:
				//
				// %base1 = gep %truebase, %i
				// %ptr1 = gep %base1, 0
				// load %ptr1
				// %base2 = gep %truebase, %i
				// %ptr2 = gep %base2, 4
				// load %ptr2
				//
				// Here base1 and base2 are indirect bases in LdSt. Use the same
				// value for all corresponding elements.
				for (auto I = LdSt.begin(), End = LdSt.end(); std::next(I) != End; ++I) {
				if (auto *GEP = dyn_cast<GetElementPtrInst>(I->IndirectBase)) {
				if (auto *GEPNext =
				dyn_cast<GetElementPtrInst>(std::next(I)->IndirectBase)) {
				if (GEP->getPointerOperand() != GEPNext->getPointerOperand() \|\|
				GEP->getNumIndices() != GEPNext->getNumIndices() \|\|
				!std::equal(GEP->idx_begin(), GEP->idx_end(), GEPNext->idx_begin()))
				continue;

				// GEPs are equal
				std::next(I)->IndirectBase = GEP;
				}
				}
				}
				}

				static bool runOnBasicBlock(BasicBlock &BB, const TargetLibraryInfo &TLI) {
				const DataLayout &DL = BB.getModule()->getDataLayout();

				// Collect relevant load/store instructions, grouped by guessed base address
				SmallVector<LdStInfo, 16> LdSt;
				for (Instruction &I : BB) {
				Type *ValueTy;
				unsigned BaseOperandIndex;
				std::tie(ValueTy, BaseOperandIndex) = getDataTypeAndBaseIndex(&I);
				if (!ValueTy \|\| !ValueTy->isVectorTy())
				continue;

				unsigned AccessSize = DL.getTypeSizeInBits(ValueTy) / 8;
				if (isPowerOf2_32(AccessSize))
				LdSt.emplace_back(DL, &I, BaseOperandIndex, AccessSize);
				}

				if (LdSt.size() < 2)
				return false;

				refineIndirectBase(LdSt);

				DenseMap<Value *, SmallVector<LdStInfo, 16>> LdStMap;
				for (LdStInfo &LSI : LdSt) {
				LdStMap[LSI.IndirectBase].push_back(std::move(LSI));
				LLVM_DEBUG(dbgs() << "found load/store, base: " << LSI.IndirectBase
				<< ", offset: " << LSI.Offset << '\n'
				<< LSI.LdSt << '\n');
				}
				// For each group, form a chain of address increments
				bool Modified = false;
				for (auto BaseAndWorklist : LdStMap) {
				auto Worklist = BaseAndWorklist.second;
				assert(!Worklist.empty());
				int32_t RegStride = guessCustomAccessStride(Worklist);
				if (!isProfitable(Worklist, RegStride)) {
				LLVM_DEBUG(dbgs() << "not profitable to transform this ld/st group: "
				<< BaseAndWorklist.first << '\n');
				continue;
				}
				for (auto I = Worklist.begin(), E = Worklist.end(); std::next(I) != E;
				++I) {
				Modified \|= rewriteAddressCalculation(I, std::next(I), RegStride, TLI);
				}
				}
				return Modified;
				}

				bool ARMPostIndexingOpt::runOnFunction(Function &F) {
				if (skipFunction(F))
				return false;
				// If MVE is available, skip this function.
				const auto &TPC = getAnalysis<TargetPassConfig>();
				const auto &TM = TPC.getTM<TargetMachine>();
				const auto &STI = TM.getSubtarget<ARMSubtarget>(F);
				if (STI.hasMVEIntegerOps())
				return false;

				const auto &TLI = getAnalysis<TargetLibraryInfoWrapperPass>().getTLI(F);

				bool Modified = false;
				for (auto &BB : F)
				Modified \|= runOnBasicBlock(BB, TLI);
				return Modified;
				}

				FunctionPass *llvm::createARMPostIndexingOptimizationPass() {
				return new ARMPostIndexingOpt();
				}

llvm/lib/Target/ARM/ARMTargetMachine.cpp

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeARMTarget() {
RegisterTargetMachine<ARMLETargetMachine> X(getTheARMLETarget());		RegisterTargetMachine<ARMLETargetMachine> X(getTheARMLETarget());
RegisterTargetMachine<ARMLETargetMachine> A(getTheThumbLETarget());		RegisterTargetMachine<ARMLETargetMachine> A(getTheThumbLETarget());
RegisterTargetMachine<ARMBETargetMachine> Y(getTheARMBETarget());		RegisterTargetMachine<ARMBETargetMachine> Y(getTheARMBETarget());
RegisterTargetMachine<ARMBETargetMachine> B(getTheThumbBETarget());		RegisterTargetMachine<ARMBETargetMachine> B(getTheThumbBETarget());

PassRegistry &Registry = *PassRegistry::getPassRegistry();		PassRegistry &Registry = *PassRegistry::getPassRegistry();
initializeGlobalISel(Registry);		initializeGlobalISel(Registry);
initializeARMLoadStoreOptPass(Registry);		initializeARMLoadStoreOptPass(Registry);
		initializeARMPostIndexingOptPass(Registry);
initializeARMPreAllocLoadStoreOptPass(Registry);		initializeARMPreAllocLoadStoreOptPass(Registry);
initializeARMParallelDSPPass(Registry);		initializeARMParallelDSPPass(Registry);
initializeARMConstantIslandsPass(Registry);		initializeARMConstantIslandsPass(Registry);
initializeARMExecutionDomainFixPass(Registry);		initializeARMExecutionDomainFixPass(Registry);
initializeARMExpandPseudoPass(Registry);		initializeARMExpandPseudoPass(Registry);
initializeThumb2SizeReducePass(Registry);		initializeThumb2SizeReducePass(Registry);
initializeMVEVPTBlockPass(Registry);		initializeMVEVPTBlockPass(Registry);
initializeMVETPAndVPTOptimisationsPass(Registry);		initializeMVETPAndVPTOptimisationsPass(Registry);
▲ Show 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	if (TM->getOptLevel() != CodeGenOpt::None) {
// FIXME: IR passes can delete address-taken basic blocks, deleting		// FIXME: IR passes can delete address-taken basic blocks, deleting
// corresponding blockaddresses. ARMConstantPoolConstant holds references to		// corresponding blockaddresses. ARMConstantPoolConstant holds references to
// address-taken basic blocks which can be invalidated if the function		// address-taken basic blocks which can be invalidated if the function
// containing the blockaddress has already been codegen'd and the basic		// containing the blockaddress has already been codegen'd and the basic
// block is removed. Work around this by forcing all IR passes to run before		// block is removed. Work around this by forcing all IR passes to run before
// any ISel takes place. We should have a more principled way of handling		// any ISel takes place. We should have a more principled way of handling
// this. See D99707 for more details.		// this. See D99707 for more details.
addPass(createBarrierNoopPass());		addPass(createBarrierNoopPass());
		addPass(createARMPostIndexingOptimizationPass());
}		}

return false;		return false;
}		}

bool ARMPassConfig::addInstSelector() {		bool ARMPassConfig::addInstSelector() {
addPass(createARMISelDag(getARMTargetMachine(), getOptLevel()));		addPass(createARMISelDag(getARMTargetMachine(), getOptLevel()));
return false;		return false;
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/CMakeLists.txt

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	add_llvm_target(ARMCodeGen
ARMLoadStoreOptimizer.cpp		ARMLoadStoreOptimizer.cpp
ARMLowOverheadLoops.cpp		ARMLowOverheadLoops.cpp
ARMBlockPlacement.cpp		ARMBlockPlacement.cpp
ARMMCInstLower.cpp		ARMMCInstLower.cpp
ARMMachineFunctionInfo.cpp		ARMMachineFunctionInfo.cpp
ARMMacroFusion.cpp		ARMMacroFusion.cpp
ARMRegisterInfo.cpp		ARMRegisterInfo.cpp
ARMOptimizeBarriersPass.cpp		ARMOptimizeBarriersPass.cpp
		ARMPostIndexingOptimizer.cpp
ARMRegisterBankInfo.cpp		ARMRegisterBankInfo.cpp
ARMSelectionDAGInfo.cpp		ARMSelectionDAGInfo.cpp
ARMSLSHardening.cpp		ARMSLSHardening.cpp
ARMSubtarget.cpp		ARMSubtarget.cpp
ARMTargetMachine.cpp		ARMTargetMachine.cpp
ARMTargetObjectFile.cpp		ARMTargetObjectFile.cpp
ARMTargetTransformInfo.cpp		ARMTargetTransformInfo.cpp
MLxExpansionPass.cpp		MLxExpansionPass.cpp
Show All 39 Lines

llvm/test/CodeGen/ARM/O3-pipeline.ll

	Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Lazy Block Frequency Analysis			; CHECK-NEXT: Lazy Block Frequency Analysis
	; CHECK-NEXT: Optimization Remark Emitter			; CHECK-NEXT: Optimization Remark Emitter
	; CHECK-NEXT: Hardware Loop Insertion			; CHECK-NEXT: Hardware Loop Insertion
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	; CHECK-NEXT: Loop Pass Manager			; CHECK-NEXT: Loop Pass Manager
	; CHECK-NEXT: Transform predicated vector loops to use MVE tail predication			; CHECK-NEXT: Transform predicated vector loops to use MVE tail predication
	; CHECK-NEXT: A No-Op Barrier Pass			; CHECK-NEXT: A No-Op Barrier Pass
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
				; CHECK-NEXT: ARM post-indexed access optimizer
	; CHECK-NEXT: Safe Stack instrumentation pass			; CHECK-NEXT: Safe Stack instrumentation pass
	; CHECK-NEXT: Insert stack protectors			; CHECK-NEXT: Insert stack protectors
	; CHECK-NEXT: Module Verifier			; CHECK-NEXT: Module Verifier
	; CHECK-NEXT: Dominator Tree Construction			; CHECK-NEXT: Dominator Tree Construction
	; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)			; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)
	; CHECK-NEXT: Function Alias Analysis Results			; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: Post-Dominator Tree Construction			; CHECK-NEXT: Post-Dominator Tree Construction
	▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/arm-post-indexing-opt-ir.ll

This file was added.

				; RUN: opt --arm-post-indexing-opt -S -o - < %s \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv8-unknown-linux-gnueabihf"

				define <4 x float> @test(float* %A) {
				%X.ptr = bitcast float* %A to <4 x float>*
				%X = load <4 x float>, <4 x float>* %X.ptr, align 4
				%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 4
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 8
				%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}
				; CHECK-LABEL: define <4 x float> @test(float* %A) {
				; CHECK-NEXT: %X.ptr = bitcast float* %A to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %X.ptr, align 4
				; CHECK-NEXT: %postinc = getelementptr <4 x float>, <4 x float>* %X.ptr, i32 1
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %postinc1 = getelementptr <4 x float>, <4 x float>* %postinc, i32 1
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %postinc1, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				define <4 x float> @test_stride(float* %A) {
				%X.ptr = bitcast float* %A to <4 x float>*
				%X = load <4 x float>, <4 x float>* %X.ptr, align 4
				%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 12
				%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}

				; CHECK-LABEL: define <4 x float> @test_stride(float* %A) {
				; CHECK-NEXT: %X.ptr = bitcast float* %A to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %X.ptr, align 4
				; CHECK-NEXT: %oldbase.byteptr = bitcast <4 x float>* %X.ptr to i8*
				; CHECK-NEXT: %postinc.byteptr = getelementptr i8, i8* %oldbase.byteptr, i32 24
				; CHECK-NEXT: %postinc = bitcast i8* %postinc.byteptr to <4 x float>*
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %oldbase.byteptr1 = bitcast <4 x float>* %postinc to i8*
				; CHECK-NEXT: %postinc.byteptr2 = getelementptr i8, i8* %oldbase.byteptr1, i32 24
				; CHECK-NEXT: %postinc3 = bitcast i8* %postinc.byteptr2 to <4 x float>*
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %postinc3, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				define <4 x float> @test_stride_mixed(float* %A) {
				%X.ptr = bitcast float* %A to <4 x float>*
				%X = load <4 x float>, <4 x float>* %X.ptr, align 4
				%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 10
				%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}

				; CHECK-LABEL: define <4 x float> @test_stride_mixed(float* %A) {
				; CHECK-NEXT: %X.ptr = bitcast float* %A to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %X.ptr, align 4
				; CHECK-NEXT: %oldbase.byteptr = bitcast <4 x float>* %X.ptr to i8*
				; CHECK-NEXT: %postinc.byteptr = getelementptr i8, i8* %oldbase.byteptr, i32 24
				; CHECK-NEXT: %postinc = bitcast i8* %postinc.byteptr to <4 x float>*
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %postinc1 = getelementptr <4 x float>, <4 x float>* %postinc, i32 1
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %postinc1, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				; Refrain from using multiple stride registers
				define <4 x float> @test_stride_noop(float* %A) {
				%X.ptr = bitcast float* %A to <4 x float>*
				%X = load <4 x float>, <4 x float>* %X.ptr, align 4
				%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 14
				%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}

				; CHECK-LABEL: define <4 x float> @test_stride_noop(float* %A) {
				; CHECK-NEXT: %X.ptr = bitcast float* %A to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %X.ptr, align 4
				; CHECK-NEXT: %Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6
				; CHECK-NEXT: %Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				; CHECK-NEXT: %Z.ptr.elt = getelementptr inbounds float, float* %A, i32 14
				; CHECK-NEXT: %Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				define <4 x float> @test_positive_initial_offset(float* %A) {
				%X.ptr.elt = getelementptr inbounds float, float* %A, i32 8
				%X.ptr = bitcast float* %X.ptr.elt to <4 x float>*
				%X = load <4 x float>, <4 x float>* %X.ptr, align 4
				%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 12
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 16
				%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}

				; CHECK-LABEL: define <4 x float> @test_positive_initial_offset(float* %A) {
				; CHECK-NEXT: %X.ptr.elt = getelementptr inbounds float, float* %A, i32 8
				; CHECK-NEXT: %X.ptr = bitcast float* %X.ptr.elt to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %X.ptr, align 4
				; CHECK-NEXT: %postinc = getelementptr <4 x float>, <4 x float>* %X.ptr, i32 1
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %postinc1 = getelementptr <4 x float>, <4 x float>* %postinc, i32 1
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %postinc1, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				define <4 x float> @test_negative_initial_offset(float* %A) {
				%X.ptr.elt = getelementptr inbounds float, float* %A, i32 -16
				%X.ptr = bitcast float* %X.ptr.elt to <4 x float>*
				%X = load <4 x float>, <4 x float>* %X.ptr, align 4
				%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 -12
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 -8
				%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}

				; CHECK-LABEL: define <4 x float> @test_negative_initial_offset(float* %A) {
				; CHECK-NEXT: %X.ptr.elt = getelementptr inbounds float, float* %A, i32 -16
				; CHECK-NEXT: %X.ptr = bitcast float* %X.ptr.elt to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %X.ptr, align 4
				; CHECK-NEXT: %postinc = getelementptr <4 x float>, <4 x float>* %X.ptr, i32 1
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %postinc1 = getelementptr <4 x float>, <4 x float>* %postinc, i32 1
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %postinc1, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				@global_float_array = external global [128 x float], align 4
				define <4 x float> @test_global() {
				%X = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 8) to <4 x float>*), align 4
				%Y = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 12) to <4 x float>*), align 4
				%Z = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 16) to <4 x float>*), align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}

				; CHECK-LABEL: define <4 x float> @test_global() {
				; CHECK-NEXT: %1 = bitcast float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 8) to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %1, align 4
				; CHECK-NEXT: %postinc = getelementptr <4 x float>, <4 x float>* %1, i32 1
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %postinc1 = getelementptr <4 x float>, <4 x float>* %postinc, i32 1
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %postinc1, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				define <4 x float> @test_stack() {
				; Use huge alignment to test that ADD would not be converted to OR
				%array = alloca [32 x float], align 128
				%arraydecay = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 0
				call void @external_function(float* %arraydecay)
				%X.ptr = bitcast [32 x float]* %array to <4 x float>*
				%X = load <4 x float>, <4 x float>* %X.ptr, align 4
				%Y.ptr.elt = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 4
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 8
				%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
				%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
				%tmp.sum = fadd <4 x float> %X, %Y
				%sum = fadd <4 x float> %tmp.sum, %Z
				ret <4 x float> %sum
				}

				; CHECK-LABEL: define <4 x float> @test_stack() {
				; CHECK-NEXT: %array = alloca [32 x float], align 128
				; CHECK-NEXT: %arraydecay = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 0
				; CHECK-NEXT: call void @external_function(float* %arraydecay)
				; CHECK-NEXT: %X.ptr = bitcast [32 x float]* %array to <4 x float>*
				; CHECK-NEXT: %X = load <4 x float>, <4 x float>* %X.ptr, align 4
				; CHECK-NEXT: %postinc = getelementptr <4 x float>, <4 x float>* %X.ptr, i32 1
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %postinc1 = getelementptr <4 x float>, <4 x float>* %postinc, i32 1
				; CHECK-NEXT: %Z = load <4 x float>, <4 x float>* %postinc1, align 4
				; CHECK-NEXT: %tmp.sum = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %sum = fadd <4 x float> %tmp.sum, %Z
				; CHECK-NEXT: ret <4 x float> %sum
				; CHECK-NEXT: }

				define <2 x double> @test_double(double* %A) {
				%X.ptr.elt = getelementptr inbounds double, double* %A, i32 8
				%X.ptr = bitcast double* %X.ptr.elt to <2 x double>*
				%X = load <2 x double>, <2 x double>* %X.ptr, align 8
				%Y.ptr.elt = getelementptr inbounds double, double* %A, i32 10
				%Y.ptr = bitcast double* %Y.ptr.elt to <2 x double>*
				%Y = load <2 x double>, <2 x double>* %Y.ptr, align 8
				%Z.ptr.elt = getelementptr inbounds double, double* %A, i32 12
				%Z.ptr = bitcast double* %Z.ptr.elt to <2 x double>*
				%Z = load <2 x double>, <2 x double>* %Z.ptr, align 8
				%tmp.sum = fadd <2 x double> %X, %Y
				%sum = fadd <2 x double> %tmp.sum, %Z
				ret <2 x double> %sum
				}

				; CHECK-LABEL: define <2 x double> @test_double(double* %A) {
				; CHECK-NEXT: %X.ptr.elt = getelementptr inbounds double, double* %A, i32 8
				; CHECK-NEXT: %X.ptr = bitcast double* %X.ptr.elt to <2 x double>*
				; CHECK-NEXT: %X = load <2 x double>, <2 x double>* %X.ptr, align 8
				; CHECK-NEXT: %postinc = getelementptr <2 x double>, <2 x double>* %X.ptr, i32 1
				; CHECK-NEXT: %Y = load <2 x double>, <2 x double>* %postinc, align 8
				; CHECK-NEXT: %postinc1 = getelementptr <2 x double>, <2 x double>* %postinc, i32 1
				; CHECK-NEXT: %Z = load <2 x double>, <2 x double>* %postinc1, align 8
				; CHECK-NEXT: %tmp.sum = fadd <2 x double> %X, %Y
				; CHECK-NEXT: %sum = fadd <2 x double> %tmp.sum, %Z
				; CHECK-NEXT: ret <2 x double> %sum
				; CHECK-NEXT: }

				define void @test_various_instructions(float* %A) {
				%X.ptr = bitcast float* %A to i8*
				%X = call <4 x float> @llvm.arm.neon.vld1.v4f32.p0i8(i8* %X.ptr, i32 1)
				%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 4
				%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
				%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
				%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 8
				%Z.ptr = bitcast float* %Z.ptr.elt to i8*
				%Z = fadd <4 x float> %X, %Y
				tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* nonnull %Z.ptr, <4 x float> %Z, i32 4)
				ret void
				}

				; CHECK-LABEL: define void @test_various_instructions(float* %A) {
				; CHECK-NEXT: %X.ptr = bitcast float* %A to i8*
				; CHECK-NEXT: %X = call <4 x float> @llvm.arm.neon.vld1.v4f32.p0i8(i8* %X.ptr, i32 1)
				; CHECK-NEXT: %postinc.byteptr = getelementptr i8, i8* %X.ptr, i32 16
				; CHECK-NEXT: %postinc = bitcast i8* %postinc.byteptr to <4 x float>*
				; CHECK-NEXT: %Y = load <4 x float>, <4 x float>* %postinc, align 4
				; CHECK-NEXT: %Z = fadd <4 x float> %X, %Y
				; CHECK-NEXT: %oldbase.byteptr = bitcast <4 x float>* %postinc to i8*
				; CHECK-NEXT: %postinc.byteptr1 = getelementptr i8, i8* %oldbase.byteptr, i32 16
				; CHECK-NEXT: tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* nonnull %postinc.byteptr1, <4 x float> %Z, i32 4)
				; CHECK-NEXT: ret void
				; CHECK-NEXT: }

				define void @test_lsr_geps(float* %a, float* %b, i32 %n) {
				entry:
				%cmp61 = icmp sgt i32 %n, 0
				br i1 %cmp61, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%lsr.iv1 = phi i32 [ 0, %for.body.preheader ], [ %lsr.iv.next2, %for.body ]
				%lsr.iv = phi i32 [ %n, %for.body.preheader ], [ %lsr.iv.next, %for.body ]
				%0 = bitcast float* %a to i8*
				%1 = bitcast float* %b to i8*
				%uglygep19 = getelementptr i8, i8* %0, i32 %lsr.iv1
				%uglygep1920 = bitcast i8* %uglygep19 to <4 x float>*
				%2 = load <4 x float>, <4 x float>* %uglygep1920, align 4
				%uglygep16 = getelementptr i8, i8* %0, i32 %lsr.iv1
				%uglygep1617 = bitcast i8* %uglygep16 to <4 x float>*
				%scevgep18 = getelementptr <4 x float>, <4 x float>* %uglygep1617, i32 1
				%3 = load <4 x float>, <4 x float>* %scevgep18, align 4
				%uglygep13 = getelementptr i8, i8* %0, i32 %lsr.iv1
				%uglygep1314 = bitcast i8* %uglygep13 to <4 x float>*
				%scevgep15 = getelementptr <4 x float>, <4 x float>* %uglygep1314, i32 2
				%4 = load <4 x float>, <4 x float>* %scevgep15, align 4
				%uglygep10 = getelementptr i8, i8* %0, i32 %lsr.iv1
				%uglygep1011 = bitcast i8* %uglygep10 to <4 x float>*
				%scevgep12 = getelementptr <4 x float>, <4 x float>* %uglygep1011, i32 3
				%5 = load <4 x float>, <4 x float>* %scevgep12, align 4
				%uglygep8 = getelementptr i8, i8* %1, i32 %lsr.iv1
				tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* %uglygep8, <4 x float> %2, i32 4)
				%uglygep6 = getelementptr i8, i8* %1, i32 %lsr.iv1
				%scevgep7 = getelementptr i8, i8* %uglygep6, i32 16
				tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* nonnull %scevgep7, <4 x float> %3, i32 4)
				%uglygep4 = getelementptr i8, i8* %1, i32 %lsr.iv1
				%scevgep5 = getelementptr i8, i8* %uglygep4, i32 32
				tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* nonnull %scevgep5, <4 x float> %4, i32 4)
				%uglygep = getelementptr i8, i8* %1, i32 %lsr.iv1
				%scevgep = getelementptr i8, i8* %uglygep, i32 48
				tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* nonnull %scevgep, <4 x float> %5, i32 4)
				%lsr.iv.next = add i32 %lsr.iv, -1
				%lsr.iv.next2 = add nuw i32 %lsr.iv1, 64
				%exitcond.not = icmp eq i32 %lsr.iv.next, 0
				br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
				}

				; CHECK-LABEL: define void @test_lsr_geps(float* %a, float* %b, i32 %n)
				;
				; CHECK: %[[GEP:uglygep[0-9]+]] = getelementptr i8, i8* %0, i32 %lsr.iv1
				; CHECK-NEXT: %[[BASE:uglygep[0-9]+]] = bitcast i8* %[[GEP]] to <4 x float>*
				; CHECK-NEXT: load <4 x float>, <4 x float>* %[[BASE]], align 4
				; CHECK-NEXT: %[[POSTINC1:postinc[0-9]+]] = getelementptr <4 x float>, <4 x float>* %[[BASE]], i32 1
				; CHECK-NEXT: load <4 x float>, <4 x float>* %[[POSTINC1]], align 4
				; CHECK-NEXT: %[[POSTINC2:postinc[0-9]+]] = getelementptr <4 x float>, <4 x float>* %[[POSTINC1]], i32 1
				; CHECK-NEXT: load <4 x float>, <4 x float>* %[[POSTINC2]], align 4
				; CHECK-NEXT: %[[POSTINC3:postinc[0-9]+]] = getelementptr <4 x float>, <4 x float>* %[[POSTINC2]], i32 1
				; CHECK-NEXT: load <4 x float>, <4 x float>* %[[POSTINC3]], align 4
				;
				; CHECK-NEXT: %[[BASE:uglygep[0-9]+]] = getelementptr i8, i8* %1, i32 %lsr.iv1
				; CHECK-NEXT: tail call void @llvm.arm.neon.vst1.{{.*}} %[[BASE]]
				; CHECK-NEXT: %[[POSTINC1:postinc]] = getelementptr i8, i8* %[[BASE]], i32 16
				; CHECK-NEXT: tail call void @llvm.arm.neon.vst1.{{.*}} %[[POSTINC1]]
				; CHECK-NEXT: %[[POSTINC2:postinc[0-9]+]] = getelementptr i8, i8* %[[POSTINC1]], i32 16
				; CHECK-NEXT: tail call void @llvm.arm.neon.vst1.{{.*}} %[[POSTINC2]]
				; CHECK-NEXT: %[[POSTINC3:postinc[0-9]+]] = getelementptr i8, i8* %[[POSTINC2]], i32 16
				; CHECK-NEXT: tail call void @llvm.arm.neon.vst1.{{.*}} %[[POSTINC3]]

				declare void @external_function(float*)
				declare <4 x float> @llvm.arm.neon.vld1.v4f32.p0i8(i8*, i32) nounwind readonly
				declare void @llvm.arm.neon.vst1.p0i8.v4f32(i8*, <4 x float>, i32) nounwind argmemonly

llvm/test/CodeGen/ARM/arm-post-indexing-opt.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -o - < %s \| FileCheck %s			; RUN: llc -o - < %s \| FileCheck %s
				dmgreenUnsubmitted Not Done Reply Inline Actions It is best not to mix llc and opt test files. They are easier kept as separate tests, with autogenerated check lines in each. It's also good to show before and after in the review by pre-committing the tests, which makes the review easier by making it obvious what has changed. dmgreen: It is best not to mix llc and opt test files. They are easier kept as separate tests, with…

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "armv8-unknown-linux-gnueabihf"			target triple = "armv8-unknown-linux-gnueabihf"

	define <4 x float> @test(float* %A) {			define <4 x float> @test(float* %A) {
	; CHECK-LABEL: test:			; CHECK-LABEL: test:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: add r2, r0, #16			; CHECK-NEXT: vld1.32 {d16, d17}, [r0]!
	; CHECK-NEXT: mov r1, #32			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]!
	; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1
	; CHECK-NEXT: vld1.32 {d18, d19}, [r2]
	; CHECK-NEXT: vadd.f32 q8, q8, q9			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.32 {d18, d19}, [r0]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr = bitcast float* %A to <4 x float>*			%X.ptr = bitcast float* %A to <4 x float>*
	%X = load <4 x float>, <4 x float>* %X.ptr, align 4			%X = load <4 x float>, <4 x float>* %X.ptr, align 4
	%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 4			%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 4
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 8			%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 8
	%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*			%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
	%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4			%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	ret <4 x float> %sum			ret <4 x float> %sum
	}			}

	define <4 x float> @test_stride(float* %A) {			define <4 x float> @test_stride(float* %A) {
	; CHECK-LABEL: test_stride:			; CHECK-LABEL: test_stride:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: add r2, r0, #24			; CHECK-NEXT: mov r1, #24
	; CHECK-NEXT: mov r1, #48
	; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1			; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1
	; CHECK-NEXT: vld1.32 {d18, d19}, [r2]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0], r1
	; CHECK-NEXT: vadd.f32 q8, q8, q9			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.32 {d18, d19}, [r0]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr = bitcast float* %A to <4 x float>*			%X.ptr = bitcast float* %A to <4 x float>*
	%X = load <4 x float>, <4 x float>* %X.ptr, align 4			%X = load <4 x float>, <4 x float>* %X.ptr, align 4
	%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6			%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 12			%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 12
	%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*			%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
	%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4			%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	ret <4 x float> %sum			ret <4 x float> %sum
	}			}

	define <4 x float> @test_stride_mixed(float* %A) {			define <4 x float> @test_stride_mixed(float* %A) {
	; CHECK-LABEL: test_stride_mixed:			; CHECK-LABEL: test_stride_mixed:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: add r2, r0, #24			; CHECK-NEXT: mov r1, #24
	; CHECK-NEXT: mov r1, #40
	; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1			; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1
	; CHECK-NEXT: vld1.32 {d18, d19}, [r2]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]!
	; CHECK-NEXT: vadd.f32 q8, q8, q9			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.32 {d18, d19}, [r0]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr = bitcast float* %A to <4 x float>*			%X.ptr = bitcast float* %A to <4 x float>*
	%X = load <4 x float>, <4 x float>* %X.ptr, align 4			%X = load <4 x float>, <4 x float>* %X.ptr, align 4
	%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6			%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 10			%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 10
	%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*			%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
	%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4			%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	ret <4 x float> %sum			ret <4 x float> %sum
	}			}

	; Refrain from using multiple stride registers			; Refrain from using multiple stride registers
	define <4 x float> @test_stride_noop(float* %A) {			define <4 x float> @test_stride_noop(float* %A) {
	; CHECK-LABEL: test_stride_noop:			; CHECK-LABEL: test_stride_noop:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: add r2, r0, #24			; CHECK-NEXT: mov r1, #24
	; CHECK-NEXT: mov r1, #56
	; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1			; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1
	; CHECK-NEXT: vld1.32 {d18, d19}, [r2]			; CHECK-NEXT: mov r1, #32
				; CHECK-NEXT: vld1.32 {d18, d19}, [r0], r1
	; CHECK-NEXT: vadd.f32 q8, q8, q9			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.32 {d18, d19}, [r0]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr = bitcast float* %A to <4 x float>*			%X.ptr = bitcast float* %A to <4 x float>*
	%X = load <4 x float>, <4 x float>* %X.ptr, align 4			%X = load <4 x float>, <4 x float>* %X.ptr, align 4
	%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6			%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 6
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 14			%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 14
	%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*			%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
	%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4			%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	ret <4 x float> %sum			ret <4 x float> %sum
	}			}

	define <4 x float> @test_positive_initial_offset(float* %A) {			define <4 x float> @test_positive_initial_offset(float* %A) {
	; CHECK-LABEL: test_positive_initial_offset:			; CHECK-LABEL: test_positive_initial_offset:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: add r1, r0, #48			; CHECK-NEXT: add r0, r0, #32
	; CHECK-NEXT: vld1.32 {d16, d17}, [r1]			; CHECK-NEXT: vld1.32 {d16, d17}, [r0]!
	; CHECK-NEXT: add r1, r0, #32			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]!
	; CHECK-NEXT: add r0, r0, #64			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.32 {d18, d19}, [r1]
	; CHECK-NEXT: vadd.f32 q8, q9, q8
	; CHECK-NEXT: vld1.32 {d18, d19}, [r0]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr.elt = getelementptr inbounds float, float* %A, i32 8			%X.ptr.elt = getelementptr inbounds float, float* %A, i32 8
	%X.ptr = bitcast float* %X.ptr.elt to <4 x float>*			%X.ptr = bitcast float* %X.ptr.elt to <4 x float>*
	%X = load <4 x float>, <4 x float>* %X.ptr, align 4			%X = load <4 x float>, <4 x float>* %X.ptr, align 4
	%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 12			%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 12
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 16			%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 16
	%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*			%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
	%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4			%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	ret <4 x float> %sum			ret <4 x float> %sum
	}			}

	define <4 x float> @test_negative_initial_offset(float* %A) {			define <4 x float> @test_negative_initial_offset(float* %A) {
	; CHECK-LABEL: test_negative_initial_offset:			; CHECK-LABEL: test_negative_initial_offset:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: sub r1, r0, #48			; CHECK-NEXT: sub r0, r0, #64
	; CHECK-NEXT: vld1.32 {d16, d17}, [r1]			; CHECK-NEXT: vld1.32 {d16, d17}, [r0]!
	; CHECK-NEXT: sub r1, r0, #64			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]!
	; CHECK-NEXT: sub r0, r0, #32			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.32 {d18, d19}, [r1]
	; CHECK-NEXT: vadd.f32 q8, q9, q8
	; CHECK-NEXT: vld1.32 {d18, d19}, [r0]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr.elt = getelementptr inbounds float, float* %A, i32 -16			%X.ptr.elt = getelementptr inbounds float, float* %A, i32 -16
	%X.ptr = bitcast float* %X.ptr.elt to <4 x float>*			%X.ptr = bitcast float* %X.ptr.elt to <4 x float>*
	%X = load <4 x float>, <4 x float>* %X.ptr, align 4			%X = load <4 x float>, <4 x float>* %X.ptr, align 4
	%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 -12			%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 -12
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 -8			%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 -8
	%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*			%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
	%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4			%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	ret <4 x float> %sum			ret <4 x float> %sum
	}			}

	@global_float_array = external global [128 x float], align 4			@global_float_array = external global [128 x float], align 4
	define <4 x float> @test_global() {			define <4 x float> @test_global() {
	; CHECK-LABEL: test_global:			; CHECK-LABEL: test_global:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: movw r0, :lower16:global_float_array			; CHECK-NEXT: movw r0, :lower16:global_float_array
	; CHECK-NEXT: movt r0, :upper16:global_float_array			; CHECK-NEXT: movt r0, :upper16:global_float_array
	; CHECK-NEXT: add r1, r0, #48			; CHECK-NEXT: add r0, r0, #32
	; CHECK-NEXT: vld1.32 {d16, d17}, [r1]			; CHECK-NEXT: vld1.32 {d16, d17}, [r0]!
	; CHECK-NEXT: add r1, r0, #32			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]!
	; CHECK-NEXT: add r0, r0, #64			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.32 {d18, d19}, [r1]
	; CHECK-NEXT: vadd.f32 q8, q9, q8
	; CHECK-NEXT: vld1.32 {d18, d19}, [r0]			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 8) to <4 x float>*), align 4			%X = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 8) to <4 x float>*), align 4
	%Y = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 12) to <4 x float>*), align 4			%Y = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 12) to <4 x float>*), align 4
	%Z = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 16) to <4 x float>*), align 4			%Z = load <4 x float>, <4 x float>* bitcast (float* getelementptr inbounds ([128 x float], [128 x float]* @global_float_array, i32 0, i32 16) to <4 x float>*), align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	Show All 9 Lines
	; CHECK-NEXT: .setfp r11, sp, #8			; CHECK-NEXT: .setfp r11, sp, #8
	; CHECK-NEXT: add r11, sp, #8			; CHECK-NEXT: add r11, sp, #8
	; CHECK-NEXT: .pad #240			; CHECK-NEXT: .pad #240
	; CHECK-NEXT: sub sp, sp, #240			; CHECK-NEXT: sub sp, sp, #240
	; CHECK-NEXT: bfc sp, #0, #7			; CHECK-NEXT: bfc sp, #0, #7
	; CHECK-NEXT: mov r4, sp			; CHECK-NEXT: mov r4, sp
	; CHECK-NEXT: mov r0, r4			; CHECK-NEXT: mov r0, r4
	; CHECK-NEXT: bl external_function			; CHECK-NEXT: bl external_function
	; CHECK-NEXT: orr r0, r4, #32
	; CHECK-NEXT: vld1.32 {d16, d17}, [r4:128]!			; CHECK-NEXT: vld1.32 {d16, d17}, [r4:128]!
	; CHECK-NEXT: vld1.64 {d18, d19}, [r4:128]			; CHECK-NEXT: vld1.32 {d18, d19}, [r4:128]!
	; CHECK-NEXT: vadd.f32 q8, q8, q9			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vld1.64 {d18, d19}, [r0:128]			; CHECK-NEXT: vld1.32 {d18, d19}, [r4]
	; CHECK-NEXT: vadd.f32 q0, q8, q9			; CHECK-NEXT: vadd.f32 q0, q8, q9
	; CHECK-NEXT: sub sp, r11, #8			; CHECK-NEXT: sub sp, r11, #8
	; CHECK-NEXT: pop {r4, r10, r11, pc}			; CHECK-NEXT: pop {r4, r10, r11, pc}
	%array = alloca [32 x float], align 128			%array = alloca [32 x float], align 128
	%arraydecay = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 0			%arraydecay = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 0
	call void @external_function(float* %arraydecay)			call void @external_function(float* %arraydecay)
	%X.ptr = bitcast [32 x float]* %array to <4 x float>*			%X.ptr = bitcast [32 x float]* %array to <4 x float>*
	%X = load <4 x float>, <4 x float>* %X.ptr, align 4			%X = load <4 x float>, <4 x float>* %X.ptr, align 4
	%Y.ptr.elt = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 4			%Y.ptr.elt = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 4
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 8			%Z.ptr.elt = getelementptr inbounds [32 x float], [32 x float]* %array, i32 0, i32 8
	%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*			%Z.ptr = bitcast float* %Z.ptr.elt to <4 x float>*
	%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4			%Z = load <4 x float>, <4 x float>* %Z.ptr, align 4
	%tmp.sum = fadd <4 x float> %X, %Y			%tmp.sum = fadd <4 x float> %X, %Y
	%sum = fadd <4 x float> %tmp.sum, %Z			%sum = fadd <4 x float> %tmp.sum, %Z
	ret <4 x float> %sum			ret <4 x float> %sum
	}			}

	define <2 x double> @test_double(double* %A) {			define <2 x double> @test_double(double* %A) {
	; CHECK-LABEL: test_double:			; CHECK-LABEL: test_double:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: add r1, r0, #80			; CHECK-NEXT: add r0, r0, #64
	; CHECK-NEXT: vld1.64 {d16, d17}, [r1]			; CHECK-NEXT: vld1.64 {d16, d17}, [r0]!
	; CHECK-NEXT: add r1, r0, #64			; CHECK-NEXT: vld1.64 {d18, d19}, [r0]!
	; CHECK-NEXT: add r0, r0, #96			; CHECK-NEXT: vadd.f64 d20, d17, d19
	; CHECK-NEXT: vld1.64 {d18, d19}, [r1]			; CHECK-NEXT: vadd.f64 d16, d16, d18
	; CHECK-NEXT: vadd.f64 d20, d19, d17
	; CHECK-NEXT: vadd.f64 d16, d18, d16
	; CHECK-NEXT: vld1.64 {d22, d23}, [r0]			; CHECK-NEXT: vld1.64 {d22, d23}, [r0]
	; CHECK-NEXT: vadd.f64 d1, d20, d23			; CHECK-NEXT: vadd.f64 d1, d20, d23
	; CHECK-NEXT: vadd.f64 d0, d16, d22			; CHECK-NEXT: vadd.f64 d0, d16, d22
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr.elt = getelementptr inbounds double, double* %A, i32 8			%X.ptr.elt = getelementptr inbounds double, double* %A, i32 8
	%X.ptr = bitcast double* %X.ptr.elt to <2 x double>*			%X.ptr = bitcast double* %X.ptr.elt to <2 x double>*
	%X = load <2 x double>, <2 x double>* %X.ptr, align 8			%X = load <2 x double>, <2 x double>* %X.ptr, align 8
	%Y.ptr.elt = getelementptr inbounds double, double* %A, i32 10			%Y.ptr.elt = getelementptr inbounds double, double* %A, i32 10
	%Y.ptr = bitcast double* %Y.ptr.elt to <2 x double>*			%Y.ptr = bitcast double* %Y.ptr.elt to <2 x double>*
	%Y = load <2 x double>, <2 x double>* %Y.ptr, align 8			%Y = load <2 x double>, <2 x double>* %Y.ptr, align 8
	%Z.ptr.elt = getelementptr inbounds double, double* %A, i32 12			%Z.ptr.elt = getelementptr inbounds double, double* %A, i32 12
	%Z.ptr = bitcast double* %Z.ptr.elt to <2 x double>*			%Z.ptr = bitcast double* %Z.ptr.elt to <2 x double>*
	%Z = load <2 x double>, <2 x double>* %Z.ptr, align 8			%Z = load <2 x double>, <2 x double>* %Z.ptr, align 8
	%tmp.sum = fadd <2 x double> %X, %Y			%tmp.sum = fadd <2 x double> %X, %Y
	%sum = fadd <2 x double> %tmp.sum, %Z			%sum = fadd <2 x double> %tmp.sum, %Z
	ret <2 x double> %sum			ret <2 x double> %sum
	}			}

	define void @test_various_instructions(float* %A) {			define void @test_various_instructions(float* %A) {
	; CHECK-LABEL: test_various_instructions:			; CHECK-LABEL: test_various_instructions:
	; CHECK: @ %bb.0:			; CHECK: @ %bb.0:
	; CHECK-NEXT: add r2, r0, #16			; CHECK-NEXT: vld1.32 {d16, d17}, [r0]!
	; CHECK-NEXT: mov r1, #32			; CHECK-NEXT: vld1.32 {d18, d19}, [r0]!
	; CHECK-NEXT: vld1.32 {d16, d17}, [r0], r1
	; CHECK-NEXT: vld1.32 {d18, d19}, [r2]
	; CHECK-NEXT: vadd.f32 q8, q8, q9			; CHECK-NEXT: vadd.f32 q8, q8, q9
	; CHECK-NEXT: vst1.32 {d16, d17}, [r0]			; CHECK-NEXT: vst1.32 {d16, d17}, [r0]
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	%X.ptr = bitcast float* %A to i8*			%X.ptr = bitcast float* %A to i8*
	%X = call <4 x float> @llvm.arm.neon.vld1.v4f32.p0i8(i8* %X.ptr, i32 1)			%X = call <4 x float> @llvm.arm.neon.vld1.v4f32.p0i8(i8* %X.ptr, i32 1)
	%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 4			%Y.ptr.elt = getelementptr inbounds float, float* %A, i32 4
	%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*			%Y.ptr = bitcast float* %Y.ptr.elt to <4 x float>*
	%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4			%Y = load <4 x float>, <4 x float>* %Y.ptr, align 4
	%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 8			%Z.ptr.elt = getelementptr inbounds float, float* %A, i32 8
	%Z.ptr = bitcast float* %Z.ptr.elt to i8*			%Z.ptr = bitcast float* %Z.ptr.elt to i8*
	%Z = fadd <4 x float> %X, %Y			%Z = fadd <4 x float> %X, %Y
	tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* nonnull %Z.ptr, <4 x float> %Z, i32 4)			tail call void @llvm.arm.neon.vst1.p0i8.v4f32(i8* nonnull %Z.ptr, <4 x float> %Z, i32 4)
	ret void			ret void
	}			}

	define void @test_lsr_geps(float* %a, float* %b, i32 %n) {			define void @test_lsr_geps(float* %a, float* %b, i32 %n) {
	; CHECK-LABEL: test_lsr_geps:			; CHECK-LABEL: test_lsr_geps:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, r5, r11, lr}
	; CHECK-NEXT: push {r4, r5, r11, lr}
	; CHECK-NEXT: cmp r2, #1			; CHECK-NEXT: cmp r2, #1
	; CHECK-NEXT: blt .LBB10_3			; CHECK-NEXT: bxlt lr
	; CHECK-NEXT: @ %bb.1: @ %for.body.preheader			; CHECK-NEXT: .LBB10_1: @ %for.body.preheader
	; CHECK-NEXT: mov r3, #0			; CHECK-NEXT: mov r12, #0
	; CHECK-NEXT: mov r12, #48
	; CHECK-NEXT: .LBB10_2: @ %for.body			; CHECK-NEXT: .LBB10_2: @ %for.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: add lr, r0, r3			; CHECK-NEXT: add r3, r0, r12
	; CHECK-NEXT: subs r2, r2, #1			; CHECK-NEXT: subs r2, r2, #1
	; CHECK-NEXT: mov r4, lr			; CHECK-NEXT: vld1.32 {d16, d17}, [r3]!
	; CHECK-NEXT: vld1.32 {d16, d17}, [r4], r12			; CHECK-NEXT: vld1.32 {d18, d19}, [r3]!
	; CHECK-NEXT: vld1.32 {d18, d19}, [r4]			; CHECK-NEXT: vld1.32 {d20, d21}, [r3]!
	; CHECK-NEXT: add r4, lr, #32			; CHECK-NEXT: vld1.32 {d22, d23}, [r3]
	; CHECK-NEXT: vld1.32 {d20, d21}, [r4]			; CHECK-NEXT: add r3, r1, r12
	; CHECK-NEXT: add r4, lr, #16			; CHECK-NEXT: add r12, r12, #64
	; CHECK-NEXT: vld1.32 {d22, d23}, [r4]			; CHECK-NEXT: vst1.32 {d16, d17}, [r3]!
	; CHECK-NEXT: add r4, r1, r3			; CHECK-NEXT: vst1.32 {d18, d19}, [r3]!
	; CHECK-NEXT: add r5, r4, #16			; CHECK-NEXT: vst1.32 {d20, d21}, [r3]!
	; CHECK-NEXT: add r3, r3, #64			; CHECK-NEXT: vst1.32 {d22, d23}, [r3]
	; CHECK-NEXT: mov lr, r4
	; CHECK-NEXT: add r4, r4, #32
	; CHECK-NEXT: vst1.32 {d16, d17}, [lr], r12
	; CHECK-NEXT: vst1.32 {d22, d23}, [r5]
	; CHECK-NEXT: vst1.32 {d20, d21}, [r4]
	; CHECK-NEXT: vst1.32 {d18, d19}, [lr]
	; CHECK-NEXT: bne .LBB10_2			; CHECK-NEXT: bne .LBB10_2
	; CHECK-NEXT: .LBB10_3: @ %for.cond.cleanup			; CHECK-NEXT: @ %bb.3: @ %for.cond.cleanup
	; CHECK-NEXT: pop {r4, r5, r11, pc}			; CHECK-NEXT: bx lr
	entry:			entry:
	%cmp61 = icmp sgt i32 %n, 0			%cmp61 = icmp sgt i32 %n, 0
	br i1 %cmp61, label %for.body.preheader, label %for.cond.cleanup			br i1 %cmp61, label %for.body.preheader, label %for.cond.cleanup

	for.body.preheader:			for.body.preheader:
	br label %for.body			br label %for.body

	for.cond.cleanup:			for.cond.cleanup:
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/misched-fusion-aes.ll

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines

	; CHECK-LABEL: aesea:			; CHECK-LABEL: aesea:
	; CHECK: aese.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QA]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QA]]

	; CHECK: aese.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QB]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QB]]

	; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
	; CHECK: aese.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QC]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QC]]
				; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}

	; CHECK: aese.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QD]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QD]]

	; CHECK: aese.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QE]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QE]]

				; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
	; CHECK: aese.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QF]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QF]]

	; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}			; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
	; CHECK: aese.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QG]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QG]]

	; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}

	; CHECK: aese.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aese.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QH]]			; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QH]]
	}			}

	define void @aesda(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {			define void @aesda(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {
	%d0 = load <16 x i8>, <16 x i8>* %a0			%d0 = load <16 x i8>, <16 x i8>* %a0
	%a1 = getelementptr inbounds <16 x i8>, <16 x i8>* %a0, i64 1			%a1 = getelementptr inbounds <16 x i8>, <16 x i8>* %a0, i64 1
	%d1 = load <16 x i8>, <16 x i8>* %a1			%d1 = load <16 x i8>, <16 x i8>* %a1
	▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines

	; CHECK-LABEL: aesda:			; CHECK-LABEL: aesda:
	; CHECK: aesd.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QA]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QA]]

	; CHECK: aesd.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QB]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QB]]

	; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
	; CHECK: aesd.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QC]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QC]]
				; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}

	; CHECK: aesd.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QD]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QD]]

	; CHECK: aesd.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QE]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QE]]

				; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
	; CHECK: aesd.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QF]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QF]]

	; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}			; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
	; CHECK: aesd.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QG]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QG]]

	; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
	; CHECK: aesd.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}			; CHECK: aesd.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
	; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QH]]			; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QH]]
	}			}

	define void @aes_load_store(<16 x i8> %p1, <16 x i8> %p2 , <16 x i8> *%p3) {			define void @aes_load_store(<16 x i8> %p1, <16 x i8> %p2 , <16 x i8> *%p3) {
	entry:			entry:
	%x1 = alloca <16 x i8>, align 16			%x1 = alloca <16 x i8>, align 16
	%x2 = alloca <16 x i8>, align 16			%x2 = alloca <16 x i8>, align 16
	Show All 23 Lines

llvm/test/Transforms/LoopStrengthReduce/ARM/ivchain-ARM.ll

Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
br i1 %exitcond.3, label %for.end, label %for.body		br i1 %exitcond.3, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret void		ret void
}		}

; @testNeon is an important example of the nead for ivchains.		; @testNeon is an important example of the nead for ivchains.
;		;
; Currently we have two extra add.w's that keep the store address		; Loads and stores should use post-increment addressing, no add's or add.w's.
; live past the next increment because ISEL is unfortunately undoing		; Most importantly, there should be no spills or reloads!
; the store chain. ISEL also fails to convert all but one of the stores to
; post-increment addressing. However, the loads should use
; post-increment addressing, no add's or add.w's beyond the three
; mentioned. Most importantly, there should be no spills or reloads!
;		;
; A9: testNeon:		; A9: testNeon:
; A9: %.lr.ph		; A9: %.lr.ph
; A9: add.w r
; A9-NOT: lsl.w		; A9-NOT: lsl.w
; A9-NOT: {{ldr\|str\|adds\|add r}}		; A9-NOT: {{ldr\|str\|adds\|add r}}
; A9: vst1.8 {{.*}} [r{{[0-9]+}}], r{{[0-9]+}}
; A9: add.w r
; A9-NOT: {{ldr\|str\|adds\|add r}}
; A9-NOT: add.w r		; A9-NOT: add.w r
; A9: bne		; A9: bne
define hidden void @testNeon(i8* %ref_data, i32 %ref_stride, i32 %limit, <16 x i8>* nocapture %data) nounwind optsize {		define hidden void @testNeon(i8* %ref_data, i32 %ref_stride, i32 %limit, <16 x i8>* nocapture %data) nounwind optsize {
%1 = icmp sgt i32 %limit, 0		%1 = icmp sgt i32 %limit, 0
br i1 %1, label %.lr.ph, label %45		br i1 %1, label %.lr.ph, label %45

.lr.ph: ; preds = %0		.lr.ph: ; preds = %0
%2 = shl nsw i32 %ref_stride, 1		%2 = shl nsw i32 %ref_stride, 1
▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Simplify address calculation for NEON load/storeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 371409

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/Target/ARM/ARM.h

llvm/lib/Target/ARM/ARMISelLowering.h

llvm/lib/Target/ARM/ARMISelLowering.cpp

llvm/lib/Target/ARM/ARMPostIndexingOptimizer.cpp

llvm/lib/Target/ARM/ARMTargetMachine.cpp

llvm/lib/Target/ARM/CMakeLists.txt

llvm/test/CodeGen/ARM/O3-pipeline.ll

llvm/test/CodeGen/ARM/arm-post-indexing-opt-ir.ll

llvm/test/CodeGen/ARM/arm-post-indexing-opt.ll

llvm/test/CodeGen/ARM/misched-fusion-aes.ll

llvm/test/Transforms/LoopStrengthReduce/ARM/ivchain-ARM.ll

[ARM] Simplify address calculation for NEON load/store
ClosedPublic