This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64.h
10
AArch64LoadStoreInterleave.cpp
-
AArch64TargetMachine.cpp
-
CMakeLists.txt
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
arm64-variadic-aapcs.ll
-
arm64-virtual_base.ll
-
func-calls.ll
-
memcpy-f128.ll
-
optimal-load-store-pairs.ll

Differential D6054

[AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registers
AbandonedPublic

Authored by sdmitrouk on Oct 31 2014, 7:25 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
jmolloy
Jiangning

Summary

Here is the result of my tries to make memcpy() inlined in an "optimal" way, which means interleaved load/store pair instructions that use 64-bit registers.

It was suggested to make this in AArch64LoadStoreOptimizer pass, which did work until PostRA Machine Instruction Scheduler was enabled for AArch64 target, hence it became a separate pass that runs after PostRA MISched. The pass is disabled by default, but changes in tests make them pass with and without the pass.

When ldr/str is in the middle they are reordered as well except for cases like:

ldr
ldp
stp
str

which occur only on copying small amount of data and I'm not sure if its worth reordering them to

ldr
str
ldp
stp

but that can be done.

Unfortunately, I don't have AArch64 hardware to run performance test yet so I can't back it up with numbers, but such sequence was claimed to be preferred. At least this gives a way to test it. Or it can just be here for now.

Diff Detail

Event Timeline

sdmitrouk updated this revision to Diff 15614.Oct 31 2014, 7:25 AM

sdmitrouk retitled this revision from to [AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registers.

sdmitrouk updated this object.

sdmitrouk edited the test plan for this revision. (Show Details)

sdmitrouk added reviewers: jmolloy, t.p.northover, Jiangning.

sdmitrouk set the repository for this revision to rL LLVM.

sdmitrouk added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptOct 31 2014, 7:25 AM

Hi,

I think the principle here is OK. It'd have been nicer if we could convince the scheduler to do this instead, rather than going behind its back though. Have you talked to Andy Trick or Dave Estes to work out if this is possible?

Comments inline.

I'd also like Tim's signoff before this goes in.

James

lib/Target/AArch64/AArch64ISelLowering.cpp
474 ↗	(On Diff #15614)	s/af/as
lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
26	The important thing is that we have ldp/stp in that order, ideally with increasing addresses. We don't need to cluster them all together - it's the ordering of memory operations that counts I think. So we can have: ldp stp add # unrelated operation ldp stp This should be fine, and may be a good thing, depending on the microarchitecture.
49	This is a fairly generic statistic name. Something more A64 specific perhaps?
145	Wouldn't isSafeToSpeculate() conservatively do the same job here?
206	This was not mentioned in the comment; why should all loads come before all stores?
291	Here and elsewhere: single-line if's should have their {}'s removed.

This revision now requires changes to proceed.Nov 3 2014, 4:04 AM

Addressed some of comments (will comment on the rest):

Fix typo s/af/as.
NumSequences => NumLdStSequencesUpdated.
Fixed rather confusing comment.
Removed extra curly braces around body of single-line if statements.

Hi James,

It'd have been nicer if we could convince the scheduler to do this instead, rather than going behind its back though.

That's what I've tried initially, but wasn't able to do.

Have you talked to Andy Trick or Dave Estes to work out if this is possible?

I don't think so, there were a couple of related threads on llvm-dev, and doing
this similar to load/store optimizer was the only proposed solution. Scheduler
doesn't seem to have extension points where one could provide hints about
instruction, at least I didn't find a way other than to subclass it.

Comments inline.

Those not answered inline here are addressed in newer revision.

Sergey

lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
26	I'll try that, it shouldn't require a lot of changes. ideally with increasing addresses Actually, input of the pass is already in reverse order: ldp x10, x11, [x8, #48] stp x10, x11, [x9, #48] ldp x10, x11, [x8, #32] stp x10, x11, [x9, #32] ldp x10, x11, [x8, #16] stp x10, x11, [x9, #16] ldp x10, x8, [x8] stp x10, x8, [x9] which might come from `getMemcpyLoadsAndStores()` in `SelectionDAG.cpp`, which doesn't specify order.
145	I don't see any function with exactly this name, functions with similar name don't seem to fit and some are also static.
206	Wrong comment, I meant that first load should go before first store.

I'm really not sure about this one. I agree with James that hacking around with the instructions after the scheduler seems really iffy. It sounds much more like we're hitting a scheduler defect that we want to fix properly instead, unless it's a constraint that's just impossible to represent.

Perhaps some kind of forwarding from a load to a dependent store has been omitted?

I've also got some other issues with the actual implementation.

Cheers.

Tim.

lib/Target/AArch64/AArch64ISelLowering.cpp
6615–6620 ↗	(On Diff #15691)	How general is this? We should be writing for future cores as well as existing ones, and always preferring 64-bit operations seems like it'll be more and more of an oddity in future. It also seems like it belongs in a completely separate patch to the interleaving one.
lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
243–246	This seems like a really fragile way to do this. It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores.

It sounds much more like we're hitting a scheduler defect that we want to fix properly instead, unless it's a constraint that's just impossible to represent.

The scheduler seems to do its job correctly for generic case, but it seems to be missing
information about instruction operands. In this case it could ignore latency of ldp when
it's followed by stp with same operands.

Perhaps some kind of forwarding from a load to a dependent store has been omitted?

I tried gluing and/or combining nodes in a lot of ways, scheduler doesn't care about any
of these. Another way would be to introduce pseudo-instruction and expand it after
scheduling, but it requires temporary registers and its too late to allocate registers
at that point.

Regards,
Sergey

lib/Target/AArch64/AArch64ISelLowering.cpp
6615–6620 ↗	(On Diff #15691)	How general is this? We should be writing for future cores as well as existing ones, and always preferring 64-bit operations seems like it'll be more and more of an oddity in future. If I get it right, the problem with 128-bit registers is that they are floating point registers rather than general purpose ones, so as long as there is no 128-bit GP registers, this should hold. It also seems like it belongs in a completely separate patch to the interleaving one. The pass is there to allow better code generation for `memcpy()`, but they can be separated technically (pass can go first, maybe with slightly changed tests).
lib/Target/AArch64/AArch64LoadStoreInterleave.cpp
243–246	It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores. The condition isn't exactly this one, but it does have similar constrain. Next revision that works on pairs of instructions should change this.

Changed code to process pairs of loads and stores rather than all of them inside a basic block.
Removed changes that are not directly related to new pass.

Tim, James,

I've updated the diff, but I'll also ask Andy Trick and/or Dave Estes if there
is a better way that involves instruction scheduler using this review as an
example of what I want to achieve.

Cheers,
Sergey

In this case it could ignore latency of ldp when it's followed by stp with same operands.

I don't think that's right. We're not magically going to make the stp less quick, we can just issue them back to back in the same cycle. Potentially a ScheduleHazardRecognizer might be the right thing here?

In this case it could ignore latency of ldp when it's followed by stp with same operands.

I don't think that's right. We're not magically going to make the stp less quick, we can just issue them back to back in the same cycle.

Well, I didn't assume there is some magic, what I meant is that when scheduler looks for the next instruction after stp, ldp should be the best match among all predecessors.

Potentially a ScheduleHazardRecognizer might be the right thing here?

From its description, I'd say that it does the opposite: allows to postpone execution of some instruction till the next cycle.

Dave's advice to look at clustering in scheduler applied through DAG mutations almost worked, the only issue is that some "free" instructions can still be inserted between ldp and stp (previously it were instructions that compute addresses), but not sure this can be solved using clustering. The next thing is custom scheduling strategy, it might be an option, but I just started trying adding it.

Custom instruction scheduler (actually, both pre-RA and post-RA might be needed) might be a solution, but it replaces generic scheduler, which will only make scheduling worse.

No more ideas, looks like such instruction interleaving can't be achieved in LLVM with reasonable effort.

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64.h

1 line

AArch64LoadStoreInterleave.cpp

330 lines

AArch64TargetMachine.cpp

10 lines

CMakeLists.txt

1 line

test/

CodeGen/

AArch64/

arm64-variadic-aapcs.ll

2 lines

arm64-virtual_base.ll

2 lines

func-calls.ll

10 lines

memcpy-f128.ll

2 lines

optimal-load-store-pairs.ll

38 lines

Diff 15881

lib/Target/AArch64/AArch64.h

	Show All 29 Lines
	FunctionPass *createAArch64ConditionalCompares();			FunctionPass *createAArch64ConditionalCompares();
	FunctionPass *createAArch64AdvSIMDScalar();			FunctionPass *createAArch64AdvSIMDScalar();
	FunctionPass *createAArch64BranchRelaxation();			FunctionPass *createAArch64BranchRelaxation();
	FunctionPass *createAArch64ISelDag(AArch64TargetMachine &TM,			FunctionPass *createAArch64ISelDag(AArch64TargetMachine &TM,
	CodeGenOpt::Level OptLevel);			CodeGenOpt::Level OptLevel);
	FunctionPass *createAArch64StorePairSuppressPass();			FunctionPass *createAArch64StorePairSuppressPass();
	FunctionPass *createAArch64ExpandPseudoPass();			FunctionPass *createAArch64ExpandPseudoPass();
	FunctionPass *createAArch64LoadStoreOptimizationPass();			FunctionPass *createAArch64LoadStoreOptimizationPass();
				FunctionPass *createAArch64LoadStoreInterleavePass();
	ModulePass *createAArch64PromoteConstantPass();			ModulePass *createAArch64PromoteConstantPass();
	FunctionPass *createAArch64ConditionOptimizerPass();			FunctionPass *createAArch64ConditionOptimizerPass();
	FunctionPass *createAArch64AddressTypePromotionPass();			FunctionPass *createAArch64AddressTypePromotionPass();
	FunctionPass *createAArch64A57FPLoadBalancing();			FunctionPass *createAArch64A57FPLoadBalancing();
	FunctionPass *createAArch64A53Fix835769();			FunctionPass *createAArch64A53Fix835769();
	/// \brief Creates an ARM-specific Target Transformation Info pass.			/// \brief Creates an ARM-specific Target Transformation Info pass.
	ImmutablePass *			ImmutablePass *
	createAArch64TargetTransformInfoPass(const AArch64TargetMachine *TM);			createAArch64TargetTransformInfoPass(const AArch64TargetMachine *TM);

	FunctionPass *createAArch64CleanupLocalDynamicTLSPass();			FunctionPass *createAArch64CleanupLocalDynamicTLSPass();

	FunctionPass *createAArch64CollectLOHPass();			FunctionPass *createAArch64CollectLOHPass();
	} // end namespace llvm			} // end namespace llvm

	#endif			#endif

lib/Target/AArch64/AArch64LoadStoreInterleave.cpp

This file was added.

				//=- AArch64LoadStoreInterleave.cpp - Optimize Load/Store pairs for AArch64 -=//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass reorders load/store pair instructions to achieve better
				// performance. Preferred sequence of operations is as follows:
				//
				// * [1]: load pair of 64-bit registers
				// * [1]: store pair of 64-bit registers
				// * [2]: load pair of 64-bit registers
				// * [2]: store pair of 64-bit registers
				// * ...
				//
				// Example of transformation:
				//
				// Before: After:
				//
				// 1. <load1> 1. <something1>
				// 2. <something1> 2. <load1>
				// 3. <store1> 3. <store1>
				// 4. <something2> 4. <something2>
				jmolloyUnsubmitted Not Done Reply Inline Actions The important thing is that we have ldp/stp in that order, ideally with increasing addresses. We don't need to cluster them all together - it's the ordering of memory operations that counts I think. So we can have: ldp stp add # unrelated operation ldp stp This should be fine, and may be a good thing, depending on the microarchitecture. jmolloy: The important thing is that we have ldp/stp in that order, ideally with increasing addresses.
				sdmitroukAuthorUnsubmitted Not Done Reply Inline Actions I'll try that, it shouldn't require a lot of changes. ideally with increasing addresses Actually, input of the pass is already in reverse order: ldp x10, x11, [x8, #48] stp x10, x11, [x9, #48] ldp x10, x11, [x8, #32] stp x10, x11, [x9, #32] ldp x10, x11, [x8, #16] stp x10, x11, [x9, #16] ldp x10, x8, [x8] stp x10, x8, [x9] which might come from `getMemcpyLoadsAndStores()` in `SelectionDAG.cpp`, which doesn't specify order. sdmitrouk: I'll try that, it shouldn't require a lot of changes. > ideally with increasing addresses…
				// 5. <load2> 5. <load2>
				// 6. <store2> 6. <store2>
				// 7. <something3> 7. <something3>
				//
				//===----------------------------------------------------------------------===//

				#include "AArch64.h"
				#include "AArch64InstrInfo.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Target/TargetInstrInfo.h"
				#include "llvm/Target/TargetSubtargetInfo.h"
				#include <cstdlib>

				using namespace llvm;

				#define DEBUG_TYPE "aarch64-ldst-itl"

				STATISTIC(NumLdStSequencesUpdated, "Number of load/pair sequences updated");
				jmolloyUnsubmitted Not Done Reply Inline Actions This is a fairly generic statistic name. Something more A64 specific perhaps? jmolloy: This is a fairly generic statistic name. Something more A64 specific perhaps?

				namespace {
				class AArch64LoadStoreInterleave : public MachineFunctionPass {
				const TargetInstrInfo *TII;
				const TargetRegisterInfo *TRI;

				public:
				static char ID;
				AArch64LoadStoreInterleave() : MachineFunctionPass(ID) {}

				bool runOnMachineFunction(MachineFunction &MF) override;
				bool interleaveMemOp(MachineBasicBlock &MBB);
				void moveInstruction(MachineInstr *I,
				MachineBasicBlock::iterator InsertionPoint);
				const char *getPassName() const override {
				return "AArch64 LoadStore Interleave";
				}
				};
				} // end anonymous namespace

				char AArch64LoadStoreInterleave::ID = 0;

				FunctionPass *llvm::createAArch64LoadStoreInterleavePass() {
				return new AArch64LoadStoreInterleave();
				}

				// Optimizes every basic block of the function.
				bool AArch64LoadStoreInterleave::runOnMachineFunction(MachineFunction &MF) {
				DEBUG(dbgs() << "******** AArch64 LoadStore Interleaving ********\n"
				<< "********** Function: " << MF.getName() << '\n');

				const TargetMachine &TM = MF.getTarget();
				TII = static_cast<const AArch64InstrInfo *>(
				TM.getSubtargetImpl()->getInstrInfo());
				TRI = TM.getSubtargetImpl()->getRegisterInfo();

				bool Modified = false;
				for (auto &MBB : MF) {
				Modified \|= interleaveMemOp(MBB);
				}

				return Modified;
				}

				// Gets size of operands of load or store pair instruction in bytes.
				static int getOperandWidth(const MachineInstr* I) {
				switch (I->getOpcode()) {
				default:
				llvm_unreachable("Didn't expect anything except load and store pairs.");

				case AArch64::STPWi:
				case AArch64::LDPWi:
				case AArch64::STRWui:
				case AArch64::STURWi:
				case AArch64::LDRWui:
				case AArch64::LDURWi:
				return 4;

				case AArch64::STPXi:
				case AArch64::LDPXi:
				case AArch64::STRXui:
				case AArch64::STURXi:
				case AArch64::LDRXui:
				case AArch64::LDURXi:
				return 8;

				case AArch64::STPSi:
				case AArch64::LDPSi:
				case AArch64::STRSui:
				case AArch64::STURSi:
				case AArch64::LDRSui:
				case AArch64::LDURSi:
				return 4;

				case AArch64::STPDi:
				case AArch64::LDPDi:
				case AArch64::STRDui:
				case AArch64::STURDi:
				case AArch64::LDRDui:
				case AArch64::LDURDi:
				return 8;

				case AArch64::STPQi:
				case AArch64::LDPQi:
				case AArch64::STRQui:
				case AArch64::STURQi:
				case AArch64::LDRQui:
				case AArch64::LDURQi:
				return 16;
				}
				}

				// Extract base address from the instruction.
				static inline unsigned getBase(const MachineInstr* I) {
				unsigned OpNum = (I->getNumOperands() == 4) ? 2 : 1;
				return I->getOperand(OpNum).getReg();
				jmolloyUnsubmitted Not Done Reply Inline Actions Wouldn't isSafeToSpeculate() conservatively do the same job here? jmolloy: Wouldn't isSafeToSpeculate() conservatively do the same job here?
				sdmitroukAuthorUnsubmitted Not Done Reply Inline Actions I don't see any function with exactly this name, functions with similar name don't seem to fit and some are also static. sdmitrouk: I don't see any function with exactly this name, functions with similar name don't seem to fit…
				}

				// Extract offset from the instruction.
				static inline int64_t getOffset(const MachineInstr* I) {
				unsigned OpNum = (I->getNumOperands() == 4) ? 3 : 2;
				return I->getOperand(OpNum).getImm();
				}

				// Checks whether store instruction can be moved before the load instruction.
				static bool isSafeToMoveStore(MachineInstr I, MachineInstr St) {
				if (getBase(I) != getBase(St))
				return false;

				int IOperandWidth;

				switch (I->getOpcode()) {
				default:
				return false;

				case AArch64::STPSi:
				case AArch64::STPWi:
				case AArch64::STPDi:
				case AArch64::STPXi:
				case AArch64::STPQi:
				IOperandWidth = getOperandWidth(I);
				break;

				case AArch64::STRSui:
				case AArch64::STURSi:
				case AArch64::STRWui:
				case AArch64::STURWi:
				IOperandWidth = 4;
				break;

				case AArch64::STRDui:
				case AArch64::STURDi:
				case AArch64::STRXui:
				case AArch64::STURXi:
				IOperandWidth = 8;
				break;

				case AArch64::STRQui:
				case AArch64::STURQi:
				IOperandWidth = 16;
				break;
				}

				int StOperandWidth = getOperandWidth(St);

				int64_t StBegin = getOffset(St)*StOperandWidth;
				int64_t StEnd = StBegin + StOperandWidth;

				int64_t IBegin = getOffset(I)*IOperandWidth;
				int64_t IEnd = IBegin + IOperandWidth;

				return (IBegin <= StBegin \|\| IBegin >= StEnd) &&
				(IEnd <= StBegin \|\| IEnd >= StEnd) &&
				(IBegin != StBegin \|\| IEnd != StEnd);
				}

				// Checks that instruction can be safely moved outside pair of load and store
				jmolloyUnsubmitted Not Done Reply Inline Actions This was not mentioned in the comment; why should all loads come before all stores? jmolloy: This was not mentioned in the comment; why should all loads come before all stores?
				sdmitroukAuthorUnsubmitted Not Done Reply Inline Actions Wrong comment, I meant that first load should go before first store. sdmitrouk: Wrong comment, I meant that first load should go before first store.
				// pair instructions.
				static bool isSafeInstruction(MachineInstr I, MachineInstr Ld,
				MachineInstr St, const TargetRegisterInfo TRI) {
				if (I->isDebugValue())
				return true;

				if (I->isCall() \|\| I->isTerminator() \|\| I->hasUnmodeledSideEffects())
				return false;

				if (I->mayStore() && !isSafeToMoveStore(I, St))
				return false;

				unsigned Base = getBase(Ld);
				for (const MachineOperand &MO : I->operands()) {
				if (!MO.isReg())
				continue;

				unsigned Reg = MO.getReg();
				if (MO.isDef() && TRI->regsOverlap(Reg, Base))
				return false;
				}

				return true;
				}

				// Collects links to load and store instructions from the basic block.
				static void collectLoadAndStores(MachineBasicBlock &MBB,
				SmallVectorImpl<MachineInstr*> &Lds,
				SmallVectorImpl<MachineInstr*> &Sts) {
				for (MachineInstr &MI : MBB) {
				switch (MI.getOpcode()) {
				default:
				// Just move on to the next instruction.
				break;

				case AArch64::STPSi:
				case AArch64::STPDi:
				case AArch64::STPQi:
				case AArch64::STPWi:
				case AArch64::STPXi:
				t.p.northoverUnsubmitted Not Done Reply Inline Actions This seems like a really fragile way to do this. It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores. t.p.northover: This seems like a really fragile way to do this. It's only ever going to work on a basic block…
				sdmitroukAuthorUnsubmitted Not Done Reply Inline Actions It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores. The condition isn't exactly this one, but it does have similar constrain. Next revision that works on pairs of instructions should change this. sdmitrouk: > It's only ever going to work on a basic block with a single memcpy operation and no other…
				// Sequence of interesting operations should go first.
				if (!Lds.empty())
				Sts.push_back(&MI);
				break;

				case AArch64::LDPSi:
				case AArch64::LDPDi:
				case AArch64::LDPQi:
				case AArch64::LDPWi:
				case AArch64::LDPXi:
				Lds.push_back(&MI);
				break;
				}
				}
				}

				// Checks if a set of load and store instructions can be safely reordered.
				static bool isSafeToReorder(MachineInstr Ld, MachineInstr St,
				const TargetRegisterInfo *TRI) {
				MachineBasicBlock::iterator I = Ld;
				MachineBasicBlock::iterator E = St;
				while (++I != E) {
				if (!isSafeInstruction(&*I, Ld, St, TRI))
				return false;
				}

				return true;
				}

				// Checks whether it's allowed and makes sense to move load instruction towards
				// store instruction.
				static bool shouldTryToMove(MachineInstr Ld, MachineInstr St) {
				return getOperandWidth(Ld) == getOperandWidth(St) &&
				Ld->getOperand(0).getReg() == St->getOperand(0).getReg() &&
				Ld->getOperand(1).getReg() == St->getOperand(1).getReg() &&
				getOffset(Ld) == getOffset(St) &&
				std::distance(MachineBasicBlock::iterator(Ld),
				MachineBasicBlock::iterator(St)) > 1;
				}

				// Evaluates possibility and performs reordering of load and store instructions
				// within basic block.
				bool AArch64LoadStoreInterleave::interleaveMemOp(MachineBasicBlock &MBB) {
				SmallVector<MachineInstr*, 8> Lds;
				SmallVector<MachineInstr*, 8> Sts;
				jmolloyUnsubmitted Not Done Reply Inline Actions Here and elsewhere: single-line if's should have their {}'s removed. jmolloy: Here and elsewhere: single-line if's should have their {}'s removed.

				collectLoadAndStores(MBB, Lds, Sts);

				bool Changed = false;

				for (int i = Lds.size() - 1; i >= 0; --i) {
				for (int j = 0, n2 = Sts.size(); j < n2; ++j) {
				MachineInstr *Ld = Lds[i];
				MachineInstr *St = Sts[j];

				if (!shouldTryToMove(Ld, St))
				continue;

				if (!isSafeToReorder(Ld, St, TRI))
				continue;

				moveInstruction(Ld, St);

				++NumLdStSequencesUpdated;
				Changed = true;

				break;
				}
				}

				return Changed;
				}

				// Moves load or store pair instruction before the insertion point.
				void AArch64LoadStoreInterleave::moveInstruction(
				MachineInstr *I, MachineBasicBlock::iterator InsertionPoint) {
				MachineInstr NewI = BuildMI(I->getParent(), InsertionPoint,
				I->getDebugLoc(), TII->get(I->getOpcode()));
				for (const MachineOperand &operand : I->operands()) {
				NewI->addOperand(operand);
				}

				I->eraseFromParent();
				}

lib/Target/AArch64/AArch64TargetMachine.cpp

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	EnableCondOpt("aarch64-condopt",
cl::desc("Enable the condition optimizer pass"),		cl::desc("Enable the condition optimizer pass"),
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

static cl::opt<bool>		static cl::opt<bool>
EnableA53Fix835769("aarch64-fix-cortex-a53-835769", cl::Hidden,		EnableA53Fix835769("aarch64-fix-cortex-a53-835769", cl::Hidden,
cl::desc("Work around Cortex-A53 erratum 835769"),		cl::desc("Work around Cortex-A53 erratum 835769"),
cl::init(false));		cl::init(false));

		static cl::opt<bool>
		EnableAArch64InterleavedMemOp("aarch64-interleaved-ldstp", cl::Hidden,
		cl::desc("Allow AArch64 load/store clustering and "
		"interleaving"),
		cl::init(false));

extern "C" void LLVMInitializeAArch64Target() {		extern "C" void LLVMInitializeAArch64Target() {
// Register the target.		// Register the target.
RegisterTargetMachine<AArch64leTargetMachine> X(TheAArch64leTarget);		RegisterTargetMachine<AArch64leTargetMachine> X(TheAArch64leTarget);
RegisterTargetMachine<AArch64beTargetMachine> Y(TheAArch64beTarget);		RegisterTargetMachine<AArch64beTargetMachine> Y(TheAArch64beTarget);
RegisterTargetMachine<AArch64leTargetMachine> Z(TheARM64Target);		RegisterTargetMachine<AArch64leTargetMachine> Z(TheARM64Target);
}		}

/// TargetMachine ctor - Create an AArch64 architecture model.		/// TargetMachine ctor - Create an AArch64 architecture model.
▲ Show 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	bool AArch64PassConfig::addPreSched2() {
addPass(createAArch64ExpandPseudoPass());		addPass(createAArch64ExpandPseudoPass());
// Use load/store pair instructions when possible.		// Use load/store pair instructions when possible.
if (TM->getOptLevel() != CodeGenOpt::None && EnableLoadStoreOpt)		if (TM->getOptLevel() != CodeGenOpt::None && EnableLoadStoreOpt)
addPass(createAArch64LoadStoreOptimizationPass());		addPass(createAArch64LoadStoreOptimizationPass());
return true;		return true;
}		}

bool AArch64PassConfig::addPreEmitPass() {		bool AArch64PassConfig::addPreEmitPass() {
		// Reorder load/store pair instruction for better performance.
		if (TM->getOptLevel() != CodeGenOpt::None && EnableLoadStoreOpt &&
		EnableAArch64InterleavedMemOp)
		addPass(createAArch64LoadStoreInterleavePass());
if (EnableA53Fix835769)		if (EnableA53Fix835769)
addPass(createAArch64A53Fix835769());		addPass(createAArch64A53Fix835769());
// Relax conditional branch instructions if they're otherwise out of		// Relax conditional branch instructions if they're otherwise out of
// range of their destination.		// range of their destination.
addPass(createAArch64BranchRelaxation());		addPass(createAArch64BranchRelaxation());
if (TM->getOptLevel() != CodeGenOpt::None && EnableCollectLOH &&		if (TM->getOptLevel() != CodeGenOpt::None && EnableCollectLOH &&
TM->getSubtarget<AArch64Subtarget>().isTargetMachO())		TM->getSubtarget<AArch64Subtarget>().isTargetMachO())
addPass(createAArch64CollectLOHPass());		addPass(createAArch64CollectLOHPass());
return true;		return true;
}		}

lib/Target/AArch64/CMakeLists.txt

Show All 27 Lines	add_llvm_target(AArch64CodeGen
AArch64FastISel.cpp		AArch64FastISel.cpp
AArch64A53Fix835769.cpp		AArch64A53Fix835769.cpp
AArch64FrameLowering.cpp		AArch64FrameLowering.cpp
AArch64ConditionOptimizer.cpp		AArch64ConditionOptimizer.cpp
AArch64ISelDAGToDAG.cpp		AArch64ISelDAGToDAG.cpp
AArch64ISelLowering.cpp		AArch64ISelLowering.cpp
AArch64InstrInfo.cpp		AArch64InstrInfo.cpp
AArch64LoadStoreOptimizer.cpp		AArch64LoadStoreOptimizer.cpp
		AArch64LoadStoreInterleave.cpp
AArch64MCInstLower.cpp		AArch64MCInstLower.cpp
AArch64PromoteConstant.cpp		AArch64PromoteConstant.cpp
AArch64PBQPRegAlloc.cpp		AArch64PBQPRegAlloc.cpp
AArch64RegisterInfo.cpp		AArch64RegisterInfo.cpp
AArch64SelectionDAGInfo.cpp		AArch64SelectionDAGInfo.cpp
AArch64StorePairSuppress.cpp		AArch64StorePairSuppress.cpp
AArch64Subtarget.cpp		AArch64Subtarget.cpp
AArch64TargetMachine.cpp		AArch64TargetMachine.cpp
Show All 12 Lines

test/CodeGen/AArch64/arm64-variadic-aapcs.ll

	; RUN: llc -verify-machineinstrs -mtriple=arm64-linux-gnu -pre-RA-sched=linearize -enable-misched=false < %s \| FileCheck %s			; RUN: llc -verify-machineinstrs -mtriple=arm64-linux-gnu -pre-RA-sched=linearize -enable-misched=false < %s -mcpu=cyclone \| FileCheck %s

	%va_list = type {i8, i8, i8*, i32, i32}			%va_list = type {i8, i8, i8*, i32, i32}

	@var = global %va_list zeroinitializer, align 8			@var = global %va_list zeroinitializer, align 8

	declare void @llvm.va_start(i8*)			declare void @llvm.va_start(i8*)

	define void @test_simple(i32 %n, ...) {			define void @test_simple(i32 %n, ...) {
	▲ Show 20 Lines • Show All 134 Lines • Show Last 20 Lines

test/CodeGen/AArch64/arm64-virtual_base.ll

	; RUN: llc < %s -O3 -march arm64 \| FileCheck %s			; RUN: llc < %s -O3 -march arm64 -mcpu=cyclone \| FileCheck %s
	; <rdar://13463602>			; <rdar://13463602>

	%struct.Counter_Struct = type { i64, i64 }			%struct.Counter_Struct = type { i64, i64 }
	%struct.Bicubic_Patch_Struct = type { %struct.Method_Struct, i32, %struct.Object_Struct, %struct.Texture_Struct, %struct.Interior_Struct, %struct.Object_Struct, %struct.Object_Struct, %struct.Bounding_Box_Struct, i64, i32, i32, i32, [4 x [4 x [3 x double]]], [3 x double], double, double, %struct.Bezier_Node_Struct* }			%struct.Bicubic_Patch_Struct = type { %struct.Method_Struct, i32, %struct.Object_Struct, %struct.Texture_Struct, %struct.Interior_Struct, %struct.Object_Struct, %struct.Object_Struct, %struct.Bounding_Box_Struct, i64, i32, i32, i32, [4 x [4 x [3 x double]]], [3 x double], double, double, %struct.Bezier_Node_Struct* }
	%struct.Method_Struct = type { i32 (%struct.Object_Struct, %struct.Ray_Struct, %struct.istack_struct), i32 (double, %struct.Object_Struct), void (double, %struct.Object_Struct, %struct.istk_entry), i8 (%struct.Object_Struct), void (%struct.Object_Struct, double, %struct.Transform_Struct), void (%struct.Object_Struct, double, %struct.Transform_Struct), void (%struct.Object_Struct, double, %struct.Transform_Struct), void (%struct.Object_Struct, %struct.Transform_Struct), void (%struct.Object_Struct), void (%struct.Object_Struct)* }			%struct.Method_Struct = type { i32 (%struct.Object_Struct, %struct.Ray_Struct, %struct.istack_struct), i32 (double, %struct.Object_Struct), void (double, %struct.Object_Struct, %struct.istk_entry), i8 (%struct.Object_Struct), void (%struct.Object_Struct, double, %struct.Transform_Struct), void (%struct.Object_Struct, double, %struct.Transform_Struct), void (%struct.Object_Struct, double, %struct.Transform_Struct), void (%struct.Object_Struct, %struct.Transform_Struct), void (%struct.Object_Struct), void (%struct.Object_Struct)* }
	%struct.Object_Struct = type { %struct.Method_Struct, i32, %struct.Object_Struct, %struct.Texture_Struct, %struct.Interior_Struct, %struct.Object_Struct, %struct.Object_Struct, %struct.Bounding_Box_Struct, i64 }			%struct.Object_Struct = type { %struct.Method_Struct, i32, %struct.Object_Struct, %struct.Texture_Struct, %struct.Interior_Struct, %struct.Object_Struct, %struct.Object_Struct, %struct.Bounding_Box_Struct, i64 }
	%struct.Texture_Struct = type { i16, i16, i16, i32, float, float, float, %struct.Warps_Struct, %struct.Pattern_Struct, %struct.Blend_Map_Struct, %union.anon.9, %struct.Texture_Struct, %struct.Pigment_Struct, %struct.Tnormal_Struct, %struct.Finish_Struct, %struct.Texture_Struct, i32 }			%struct.Texture_Struct = type { i16, i16, i16, i32, float, float, float, %struct.Warps_Struct, %struct.Pattern_Struct, %struct.Blend_Map_Struct, %union.anon.9, %struct.Texture_Struct, %struct.Pigment_Struct, %struct.Tnormal_Struct, %struct.Finish_Struct, %struct.Texture_Struct, i32 }
	%struct.Warps_Struct = type { i16, %struct.Warps_Struct* }			%struct.Warps_Struct = type { i16, %struct.Warps_Struct* }
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

test/CodeGen/AArch64/func-calls.ll

	; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64-none-linux-gnu \| FileCheck %s --check-prefix=CHECK			; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64-none-linux-gnu -mcpu=cyclone \| FileCheck %s --check-prefix=CHECK
	; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64-none-linux-gnu -mattr=-neon \| FileCheck --check-prefix=CHECK-NONEON %s			; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64-none-linux-gnu -mattr=-neon -mcpu=cyclone \| FileCheck --check-prefix=CHECK-NONEON %s
	; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64-none-linux-gnu -mattr=-fp-armv8 \| FileCheck --check-prefix=CHECK-NOFP %s			; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64-none-linux-gnu -mattr=-fp-armv8 -mcpu=cyclone \| FileCheck --check-prefix=CHECK-NOFP %s
	; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64_be-none-linux-gnu \| FileCheck --check-prefix=CHECK-BE %s			; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64_be-none-linux-gnu -mcpu=cyclone \| FileCheck --check-prefix=CHECK-BE %s

	%myStruct = type { i64 , i8, i32 }			%myStruct = type { i64 , i8, i32 }

	@var8 = global i8 0			@var8 = global i8 0
	@var8_2 = global i8 0			@var8_2 = global i8 0
	@var32 = global i32 0			@var32 = global i32 0
	@var64 = global i64 0			@var64 = global i64 0
	@var128 = global i128 0			@var128 = global i128 0
	▲ Show 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
	; CHECK-NONEON: stp [[I128LO]], [[I128HI]], [sp, #16]			; CHECK-NONEON: stp [[I128LO]], [[I128HI]], [sp, #16]
	; CHECK: bl check_i128_stackalign			; CHECK: bl check_i128_stackalign

	call void @check_i128_regalign(i32 0, i128 42)			call void @check_i128_regalign(i32 0, i128 42)
	; CHECK-NOT: mov x1			; CHECK-NOT: mov x1
	; CHECK-LE: movz x2, #{{0x2a\|42}}			; CHECK-LE: movz x2, #{{0x2a\|42}}
	; CHECK-LE: mov x3, xzr			; CHECK-LE: mov x3, xzr
	; CHECK-BE: movz {{x\|w}}3, #{{0x2a\|42}}			; CHECK-BE: movz {{x\|w}}3, #{{0x2a\|42}}
	; CHECK-BE: mov x2, xzr			; CHECK-BE: mov{{z?}} x2, {{xzr\|#0}}
	; CHECK: bl check_i128_regalign			; CHECK: bl check_i128_regalign

	ret void			ret void
	}			}

	@fptr = global void()* null			@fptr = global void()* null

	define void @check_indirect_call() {			define void @check_indirect_call() {
	; CHECK-LABEL: check_indirect_call:			; CHECK-LABEL: check_indirect_call:
	%func = load void()** @fptr			%func = load void()** @fptr
	call void %func()			call void %func()
	; CHECK: ldr [[FPTR:x[0-9]+]], [{{x[0-9]+}}, {{#?}}:lo12:fptr]			; CHECK: ldr [[FPTR:x[0-9]+]], [{{x[0-9]+}}, {{#?}}:lo12:fptr]
	; CHECK: blr [[FPTR]]			; CHECK: blr [[FPTR]]

	ret void			ret void
	}			}

test/CodeGen/AArch64/memcpy-f128.ll

	; RUN: llc < %s -march=aarch64 -mtriple=aarch64-linux-gnu \| FileCheck %s			; RUN: llc < %s -march=aarch64 -mtriple=aarch64-linux-gnu -mcpu=cyclone \| FileCheck %s

	%structA = type { i128 }			%structA = type { i128 }
	@stubA = internal unnamed_addr constant %structA zeroinitializer, align 8			@stubA = internal unnamed_addr constant %structA zeroinitializer, align 8

	; Make sure we don't hit llvm_unreachable.			; Make sure we don't hit llvm_unreachable.

	define void @test1() {			define void @test1() {
	; CHECK-LABEL: @test1			; CHECK-LABEL: @test1
	Show All 10 Lines

test/CodeGen/AArch64/optimal-load-store-pairs.ll

This file was added.

				; RUN: llc < %s -mcpu=cortex-a53 -march=aarch64 -mtriple=aarch64-linux-gnu -aarch64-interleaved-ldstp=1 \| FileCheck %s
				; RUN: llc < %s -mcpu=cortex-a57 -march=aarch64 -mtriple=aarch64-linux-gnu -aarch64-interleaved-ldstp=1 \| FileCheck %s

				; Here "optimal" means interleaving loads and stores without any instructions in
				; the middle.

				; marked as external to prevent possible optimizations
				@a = external global [4 x i32]
				@b = external global [4 x i32]

				define void @copy-56-bytes-with-8-byte-registers() {
				; CHECK-LABEL: @copy-56-bytes-with-8-byte-registers
				; CHECK: ldp {{q[0-9]+}}
				; CHECK-NOT: {{adrp\|add}}
				; CHECK: stp {{q[0-9]+}}
				; CHECK: ret
				entry:
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast ([4 x i32]* @a to i8), i8 bitcast ([4 x i32]* @b to i8*), i64 56, i32 8, i1 false)
				ret void
				}

				define void @copy-64-bytes-with-8-byte-registers() {
				; CHECK-LABEL: @copy-64-bytes-with-8-byte-registers
				; CHECK: adrp
				; CHECK: add
				; CHECK: adrp
				; CHECK: add
				; CHECK: ldp [[v1:q[0-9]+]], [[v2:q[0-9]+]]
				; CHECK: stp [[v1]], [[v2]]
				; CHECK: ldp [[v3:q[0-9]+]], [[v4:q[0-9]+]]
				; CHECK: stp [[v3]], [[v4]]
				; CHECK: ret
				entry:
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast ([4 x i32]* @a to i8), i8 bitcast ([4 x i32]* @b to i8*), i64 64, i32 8, i1 false)
				ret void
				}

				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i32, i1)

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registersAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 15881

lib/Target/AArch64/AArch64.h

lib/Target/AArch64/AArch64LoadStoreInterleave.cpp

lib/Target/AArch64/AArch64TargetMachine.cpp

lib/Target/AArch64/CMakeLists.txt

test/CodeGen/AArch64/arm64-variadic-aapcs.ll

test/CodeGen/AArch64/arm64-virtual_base.ll

test/CodeGen/AArch64/func-calls.ll

test/CodeGen/AArch64/memcpy-f128.ll

test/CodeGen/AArch64/optimal-load-store-pairs.ll

[AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registers
AbandonedPublic