This is an archive of the discontinued LLVM Phabricator instance.

Differential D119006

[AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases
ClosedPublic

Authored by foad on Feb 4 2022, 8:31 AM.

Download Raw Diff

Details

Reviewers

piotr
critson
arsenm
rampitec
tstellar
nhaehnle

Commits

rG359a792f9b13: [AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases

Summary

Previously when combining two loads this pass would sink the
first one down to the second one, putting the combined load
where the second one was. It would also sink any intervening
instructions which depended on the first load down to just
after the combined load.

For example, if we started with this sequence of
instructions (code flowing from left to right):

X A B C D E F Y

After combining loads X and Y into XY we might end up with:

A B C D E F XY

But if B D and F depended on X, we would get:

A C E XY B D F

Now if the original code had some short disjoint live ranges
from A to B, C to D and E to F, in the transformed code
these live ranges will be long and overlapping. In this way
a single merge of two loads could cause an unbounded
increase in register pressure.

To fix this, change the way the way that loads are moved in
order to merge them so that:

The second load is moved up to the first one. (But when merging stores, we still move the first store down to the second one.)
Intervening instructions are never moved.
Instead, if we find an intervening instruction that would need to be moved, give up on the merge. But this case should now be pretty rare because normal stores have no outputs, and normal loads only have address register inputs, but these will be identical for any pair of loads that we try to merge.

As well as fixing the unbounded register pressure increase
problem, moving loads up and stores down seems like it
should usually be a win for memory latency reasons.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

foad created this revision.Feb 4 2022, 8:31 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 5 others. · View Herald TranscriptFeb 4 2022, 8:31 AM

foad requested review of this revision.Feb 4 2022, 8:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2022, 8:31 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

foad added a parent revision: D118994: [AMDGPU] SILoadStoreOptimizer: rewrite checkAndPrepareMerge. NFCI..Feb 4 2022, 8:32 AM

Harbormaster completed remote builds in B147639: Diff 405978.Feb 4 2022, 8:32 AM

Doesn't moving a load higher and a store lower increase register pressure itself?

In D119006#3297626, @rampitec wrote:

Doesn't moving a load higher and a store lower increase register pressure itself?

Yes, but in a very controlled way: exactly one live range gets lengthened. I'm trying to avoid the kind of crzay problems we've seen where merging one pair of loads increases the pressure by several hundred vgprs.

I statically compiled 10,000 graphics shaders to look at the effect of this patch on register pressure.
309 shaders had their overall vgpr count reduced, by an average of 2.9 vgprs.
310 shaders had their overal vgpr count increased, by an average of 2.5 vgprs.

So that indicates an improvement in the average vgpr count.

Btw, does the patch result in a noticeably smaller number of merges on average?

arsenm accepted this revision.Feb 7 2022, 11:13 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605	Generally those don't exist at this point outside of copies

This revision is now accepted and ready to land.Feb 7 2022, 11:13 AM

In D119006#3301526, @piotr wrote:

So that indicates an improvement in the average vgpr count.

Yes.

Btw, does the patch result in a noticeably smaller number of merges on average?

No, to my surprise there was absolutely no difference in the amount of merging in any of the 10,000 shaders. I checked by diffing the instruction mix for each shader. The only differences were in VALU and SALU instructions.

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605	What don't? Subregs or partial physreg overlaps or both?

arsenm added inline comments.Feb 7 2022, 11:59 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605	Both. In SSA those generally appear as plain copies to do the extract or copy from physreg

In D119006#3302156, @foad wrote:

In D119006#3301526, @piotr wrote:

So that indicates an improvement in the average vgpr count.

Yes.

Btw, does the patch result in a noticeably smaller number of merges on average?

No, to my surprise there was absolutely no difference in the amount of merging in any of the 10,000 shaders. I checked by diffing the instruction mix for each shader. The only differences were in VALU and SALU instructions.

Cool, that's even better - it means the improved vgpr count was not achieved at the expense of the actual merging.

Remove comments about subregs.

foad marked an inline comment as done.Feb 17 2022, 7:21 AM

foad added inline comments.

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605	OK, I just removed the comments.

Performance testing on graphics frames showed no nasty surprises (and no big improvements either).

Harbormaster completed remote builds in B150238: Diff 409647.Feb 17 2022, 8:03 AM

This revision was landed with ongoing or failed builds.Feb 21 2022, 2:51 AM

Closed by commit rG359a792f9b13: [AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases (authored by foad). · Explain Why

This revision was automatically updated to reflect the committed changes.

foad added a commit: rG359a792f9b13: [AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SILoadStoreOptimizer.cpp

323 lines

test/

CodeGen/

AMDGPU/

ds-combine-with-dependence.ll

8 lines

ds_read2.ll

14 lines

lower-lds-struct-aa.ll

16 lines

merge-load-store-physreg.mir

2 lines

merge-out-of-order-ldst.mir

3 lines

merge-tbuffer.mir

8 lines

si-triv-disjoint-mem-access.ll

2 lines

Diff 410261

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 179 Lines • ▼ Show 20 Lines
private:		private:
const GCNSubtarget *STM = nullptr;		const GCNSubtarget *STM = nullptr;
const SIInstrInfo *TII = nullptr;		const SIInstrInfo *TII = nullptr;
const SIRegisterInfo *TRI = nullptr;		const SIRegisterInfo *TRI = nullptr;
MachineRegisterInfo *MRI = nullptr;		MachineRegisterInfo *MRI = nullptr;
AliasAnalysis *AA = nullptr;		AliasAnalysis *AA = nullptr;
bool OptimizeAgain;		bool OptimizeAgain;

		bool canSwapInstructions(const DenseSet<Register> &ARegDefs,
		const DenseSet<Register> &ARegUses,
		const MachineInstr &A, const MachineInstr &B) const;
static bool dmasksCanBeCombined(const CombineInfo &CI,		static bool dmasksCanBeCombined(const CombineInfo &CI,
const SIInstrInfo &TII,		const SIInstrInfo &TII,
const CombineInfo &Paired);		const CombineInfo &Paired);
static bool offsetsCanBeCombined(CombineInfo &CI, const GCNSubtarget &STI,		static bool offsetsCanBeCombined(CombineInfo &CI, const GCNSubtarget &STI,
CombineInfo &Paired, bool Modify = false);		CombineInfo &Paired, bool Modify = false);
static bool widthsFit(const GCNSubtarget &STI, const CombineInfo &CI,		static bool widthsFit(const GCNSubtarget &STI, const CombineInfo &CI,
const CombineInfo &Paired);		const CombineInfo &Paired);
static unsigned getNewOpcode(const CombineInfo &CI, const CombineInfo &Paired);		static unsigned getNewOpcode(const CombineInfo &CI, const CombineInfo &Paired);
static std::pair<unsigned, unsigned> getSubRegIdxs(const CombineInfo &CI,		static std::pair<unsigned, unsigned> getSubRegIdxs(const CombineInfo &CI,
const CombineInfo &Paired);		const CombineInfo &Paired);
const TargetRegisterClass *getTargetRegisterClass(const CombineInfo &CI,		const TargetRegisterClass *getTargetRegisterClass(const CombineInfo &CI,
const CombineInfo &Paired);		const CombineInfo &Paired);
const TargetRegisterClass *getDataRegClass(const MachineInstr &MI) const;		const TargetRegisterClass *getDataRegClass(const MachineInstr &MI) const;

bool checkAndPrepareMerge(CombineInfo &CI, CombineInfo &Paired,		CombineInfo *checkAndPrepareMerge(CombineInfo &CI, CombineInfo &Paired);
SmallVectorImpl<MachineInstr *> &InstsToMove);

unsigned read2Opcode(unsigned EltSize) const;		unsigned read2Opcode(unsigned EltSize) const;
unsigned read2ST64Opcode(unsigned EltSize) const;		unsigned read2ST64Opcode(unsigned EltSize) const;
MachineBasicBlock::iterator mergeRead2Pair(CombineInfo &CI,		MachineBasicBlock::iterator
CombineInfo &Paired,		mergeRead2Pair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);

unsigned write2Opcode(unsigned EltSize) const;		unsigned write2Opcode(unsigned EltSize) const;
unsigned write2ST64Opcode(unsigned EltSize) const;		unsigned write2ST64Opcode(unsigned EltSize) const;
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeWrite2Pair(CombineInfo &CI, CombineInfo &Paired,		mergeWrite2Pair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeImagePair(CombineInfo &CI, CombineInfo &Paired,		mergeImagePair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeSBufferLoadImmPair(CombineInfo &CI, CombineInfo &Paired,		mergeSBufferLoadImmPair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeBufferLoadPair(CombineInfo &CI, CombineInfo &Paired,		mergeBufferLoadPair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeBufferStorePair(CombineInfo &CI, CombineInfo &Paired,		mergeBufferStorePair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeTBufferLoadPair(CombineInfo &CI, CombineInfo &Paired,		mergeTBufferLoadPair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeTBufferStorePair(CombineInfo &CI, CombineInfo &Paired,		mergeTBufferStorePair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove);		MachineBasicBlock::iterator InsertBefore);

void updateBaseAndOffset(MachineInstr &I, Register NewBase,		void updateBaseAndOffset(MachineInstr &I, Register NewBase,
int32_t NewOffset) const;		int32_t NewOffset) const;
Register computeBase(MachineInstr &MI, const MemAddress &Addr) const;		Register computeBase(MachineInstr &MI, const MemAddress &Addr) const;
MachineOperand createRegOrImm(int32_t Val, MachineInstr &MI) const;		MachineOperand createRegOrImm(int32_t Val, MachineInstr &MI) const;
Optional<int32_t> extractConstOffset(const MachineOperand &Op) const;		Optional<int32_t> extractConstOffset(const MachineOperand &Op) const;
void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr) const;		void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr) const;
/// Promotes constant offset to the immediate by adjusting the base. It		/// Promotes constant offset to the immediate by adjusting the base. It
▲ Show 20 Lines • Show All 333 Lines • ▼ Show 20 Lines
char SILoadStoreOptimizer::ID = 0;		char SILoadStoreOptimizer::ID = 0;

char &llvm::SILoadStoreOptimizerID = SILoadStoreOptimizer::ID;		char &llvm::SILoadStoreOptimizerID = SILoadStoreOptimizer::ID;

FunctionPass *llvm::createSILoadStoreOptimizerPass() {		FunctionPass *llvm::createSILoadStoreOptimizerPass() {
return new SILoadStoreOptimizer();		return new SILoadStoreOptimizer();
}		}

static void moveInstsAfter(MachineBasicBlock::iterator I,
ArrayRef<MachineInstr *> InstsToMove) {
MachineBasicBlock *MBB = I->getParent();
++I;
for (MachineInstr *MI : InstsToMove) {
MI->removeFromParent();
MBB->insert(I, MI);
}
}

static void addDefsUsesToList(const MachineInstr &MI,		static void addDefsUsesToList(const MachineInstr &MI,
DenseSet<Register> &RegDefs,		DenseSet<Register> &RegDefs,
DenseSet<Register> &PhysRegUses) {		DenseSet<Register> &RegUses) {
for (const MachineOperand &Op : MI.operands()) {		for (const auto &Op : MI.operands()) {
if (Op.isReg()) {		if (!Op.isReg())
		continue;
if (Op.isDef())		if (Op.isDef())
RegDefs.insert(Op.getReg());		RegDefs.insert(Op.getReg());
else if (Op.readsReg() && Op.getReg().isPhysical())		if (Op.readsReg())
PhysRegUses.insert(Op.getReg());		RegUses.insert(Op.getReg());
}
}
}

static bool memAccessesCanBeReordered(MachineBasicBlock::iterator A,
MachineBasicBlock::iterator B,
AliasAnalysis *AA) {
// RAW or WAR - cannot reorder
// WAW - cannot reorder
// RAR - safe to reorder
return !(A->mayStore() \|\| B->mayStore()) \|\| !A->mayAlias(AA, *B, true);
}

// Add MI and its defs to the lists if MI reads one of the defs that are
// already in the list. Returns true in that case.
static bool addToListsIfDependent(MachineInstr &MI, DenseSet<Register> &RegDefs,
DenseSet<Register> &PhysRegUses,
SmallVectorImpl<MachineInstr *> &Insts) {
for (MachineOperand &Use : MI.operands()) {
// If one of the defs is read, then there is a use of Def between I and the
// instruction that I will potentially be merged with. We will need to move
// this instruction after the merged instructions.
//
// Similarly, if there is a def which is read by an instruction that is to
// be moved for merging, then we need to move the def-instruction as well.
// This can only happen for physical registers such as M0; virtual
// registers are in SSA form.
if (Use.isReg() && ((Use.readsReg() && RegDefs.count(Use.getReg())) \|\|
(Use.isDef() && RegDefs.count(Use.getReg())) \|\|
(Use.isDef() && Use.getReg().isPhysical() &&
PhysRegUses.count(Use.getReg())))) {
Insts.push_back(&MI);
addDefsUsesToList(MI, RegDefs, PhysRegUses);
return true;
}		}
}		}

		bool SILoadStoreOptimizer::canSwapInstructions(
		const DenseSet<Register> &ARegDefs, const DenseSet<Register> &ARegUses,
		const MachineInstr &A, const MachineInstr &B) const {
		if (A.mayLoadOrStore() && B.mayLoadOrStore() &&
		(A.mayStore() \|\| B.mayStore()) && A.mayAlias(AA, B, true))
return false;		return false;
}		for (const auto &BOp : B.operands()) {
		if (!BOp.isReg())
		arsenmUnsubmitted Not Done Reply Inline Actions Generally those don't exist at this point outside of copies arsenm: Generally those don't exist at this point outside of copies
		foadAuthorUnsubmitted Done Reply Inline Actions What don't? Subregs or partial physreg overlaps or both? foad: What don't? Subregs or partial physreg overlaps or both?
		arsenmUnsubmitted Done Reply Inline Actions Both. In SSA those generally appear as plain copies to do the extract or copy from physreg arsenm: Both. In SSA those generally appear as plain copies to do the extract or copy from physreg
		foadAuthorUnsubmitted Done Reply Inline Actions OK, I just removed the comments. foad: OK, I just removed the comments.
static bool canMoveInstsAcrossMemOp(MachineInstr &MemOp,
ArrayRef<MachineInstr *> InstsToMove,
AliasAnalysis *AA) {
assert(MemOp.mayLoadOrStore());

for (MachineInstr *InstToMove : InstsToMove) {
if (!InstToMove->mayLoadOrStore())
continue;		continue;
if (!memAccessesCanBeReordered(MemOp, *InstToMove, AA))		if ((BOp.isDef() \|\| BOp.readsReg()) && ARegDefs.contains(BOp.getReg()))
		return false;
		if (BOp.isDef() && ARegUses.contains(BOp.getReg()))
return false;		return false;
}		}
return true;		return true;
}		}

// This function assumes that \p A and \p B have are identical except for		// This function assumes that \p A and \p B have are identical except for
// size and offset, and they reference adjacent memory.		// size and offset, and they reference adjacent memory.
static MachineMemOperand *combineKnownAdjacentMMOs(MachineFunction &MF,		static MachineMemOperand *combineKnownAdjacentMMOs(MachineFunction &MF,
▲ Show 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	if (const auto *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::sdst)) {
return TRI->getRegClassForReg(*MRI, Dst->getReg());		return TRI->getRegClassForReg(*MRI, Dst->getReg());
}		}
if (const auto *Src = TII->getNamedOperand(MI, AMDGPU::OpName::sdata)) {		if (const auto *Src = TII->getNamedOperand(MI, AMDGPU::OpName::sdata)) {
return TRI->getRegClassForReg(*MRI, Src->getReg());		return TRI->getRegClassForReg(*MRI, Src->getReg());
}		}
return nullptr;		return nullptr;
}		}

/// This function assumes that CI comes before Paired in a basic block.		/// This function assumes that CI comes before Paired in a basic block. Return
bool SILoadStoreOptimizer::checkAndPrepareMerge(		/// an insertion point for the merged instruction or nullptr on failure.
CombineInfo &CI, CombineInfo &Paired,		SILoadStoreOptimizer::CombineInfo *
SmallVectorImpl<MachineInstr *> &InstsToMove) {		SILoadStoreOptimizer::checkAndPrepareMerge(CombineInfo &CI,
		CombineInfo &Paired) {
// If another instruction has already been merged into CI, it may now be a		// If another instruction has already been merged into CI, it may now be a
// type that we can't do any further merging into.		// type that we can't do any further merging into.
if (CI.InstClass == UNKNOWN \|\| Paired.InstClass == UNKNOWN)		if (CI.InstClass == UNKNOWN \|\| Paired.InstClass == UNKNOWN)
return false;		return nullptr;
assert(CI.InstClass == Paired.InstClass);		assert(CI.InstClass == Paired.InstClass);

if (getInstSubclass(CI.I->getOpcode(), *TII) !=		if (getInstSubclass(CI.I->getOpcode(), *TII) !=
getInstSubclass(Paired.I->getOpcode(), *TII))		getInstSubclass(Paired.I->getOpcode(), *TII))
return false;		return nullptr;

// Check both offsets (or masks for MIMG) can be combined and fit in the		// Check both offsets (or masks for MIMG) can be combined and fit in the
// reduced range.		// reduced range.
if (CI.InstClass == MIMG) {		if (CI.InstClass == MIMG) {
if (!dmasksCanBeCombined(CI, *TII, Paired))		if (!dmasksCanBeCombined(CI, *TII, Paired))
return false;		return nullptr;
} else {		} else {
if (!widthsFit(STM, CI, Paired) \|\| !offsetsCanBeCombined(CI, STM, Paired))		if (!widthsFit(STM, CI, Paired) \|\| !offsetsCanBeCombined(CI, STM, Paired))
return false;		return nullptr;
}		}

DenseSet<Register> RegDefsToMove;		DenseSet<Register> RegDefs;
DenseSet<Register> PhysRegUsesToMove;		DenseSet<Register> RegUses;
addDefsUsesToList(*CI.I, RegDefsToMove, PhysRegUsesToMove);		CombineInfo *Where;
		if (CI.I->mayLoad()) {
MachineBasicBlock::iterator MBBE = CI.I->getParent()->end();		// Try to hoist Paired up to CI.
for (MachineBasicBlock::iterator MBBI = CI.I; ++MBBI != Paired.I;) {		addDefsUsesToList(*Paired.I, RegDefs, RegUses);
if (MBBI == MBBE) {		for (MachineBasicBlock::iterator MBBI = Paired.I; --MBBI != CI.I;) {
// CombineInfo::Order is a hint on the instruction ordering within the		if (!canSwapInstructions(RegDefs, RegUses, Paired.I, MBBI))
// basic block. This hint suggests that CI precedes Paired, which is		return nullptr;
// true most of the time. However, moveInstsAfter() processing a
// previous list may have changed this order in a situation when it
// moves an instruction which exists in some other merge list.
// In this case it must be dependent.
return false;
}		}
		Where = &CI;
// Keep going as long as one of these conditions are met:		} else {
// 1. It is safe to move I down past MBBI.		// Try to sink CI down to Paired.
// 2. It is safe to move MBBI down past the instruction that I will		addDefsUsesToList(*CI.I, RegDefs, RegUses);
// be merged into.		for (MachineBasicBlock::iterator MBBI = CI.I; ++MBBI != Paired.I;) {
		if (!canSwapInstructions(RegDefs, RegUses, CI.I, MBBI))
if (MBBI->mayLoadOrStore() &&		return nullptr;
(!memAccessesCanBeReordered(CI.I, MBBI, AA) \|\|
!canMoveInstsAcrossMemOp(*MBBI, InstsToMove, AA))) {
// We fail condition #1, but we may still be able to satisfy condition
// #2. Add this instruction to the move list and then we will check
// if condition #2 holds once we have selected the matching instruction.
InstsToMove.push_back(&*MBBI);
addDefsUsesToList(*MBBI, RegDefsToMove, PhysRegUsesToMove);
continue;
}		}
		Where = &Paired;
// When we match I with another load/store instruction we will be moving I
// down to the location of the matched instruction any uses of I will need
// to be moved down as well.
addToListsIfDependent(*MBBI, RegDefsToMove, PhysRegUsesToMove, InstsToMove);
}		}

// If Paired depends on any of the instructions we plan to move, give up.
if (addToListsIfDependent(*Paired.I, RegDefsToMove, PhysRegUsesToMove,
InstsToMove))
return false;

// We need to go through the list of instructions that we plan to
// move and make sure they are all safe to move down past the merged
// instruction.
if (!canMoveInstsAcrossMemOp(*Paired.I, InstsToMove, AA))
return false;

// Call offsetsCanBeCombined with modify = true so that the offsets are		// Call offsetsCanBeCombined with modify = true so that the offsets are
// correct for the new instruction. This should return true, because		// correct for the new instruction. This should return true, because
// this function should only be called on CombineInfo objects that		// this function should only be called on CombineInfo objects that
// have already been confirmed to be mergeable.		// have already been confirmed to be mergeable.
if (CI.InstClass == DS_READ \|\| CI.InstClass == DS_WRITE)		if (CI.InstClass == DS_READ \|\| CI.InstClass == DS_WRITE)
offsetsCanBeCombined(CI, *STM, Paired, true);		offsetsCanBeCombined(CI, *STM, Paired, true);
return true;		return Where;
}		}

unsigned SILoadStoreOptimizer::read2Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::read2Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_READ2_B32 : AMDGPU::DS_READ2_B64;		return (EltSize == 4) ? AMDGPU::DS_READ2_B32 : AMDGPU::DS_READ2_B64;
return (EltSize == 4) ? AMDGPU::DS_READ2_B32_gfx9 : AMDGPU::DS_READ2_B64_gfx9;		return (EltSize == 4) ? AMDGPU::DS_READ2_B32_gfx9 : AMDGPU::DS_READ2_B64_gfx9;
}		}

unsigned SILoadStoreOptimizer::read2ST64Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::read2ST64Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32 : AMDGPU::DS_READ2ST64_B64;		return (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32 : AMDGPU::DS_READ2ST64_B64;

return (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32_gfx9		return (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32_gfx9
: AMDGPU::DS_READ2ST64_B64_gfx9;		: AMDGPU::DS_READ2ST64_B64_gfx9;
}		}

MachineBasicBlock::iterator		MachineBasicBlock::iterator
SILoadStoreOptimizer::mergeRead2Pair(CombineInfo &CI, CombineInfo &Paired,		SILoadStoreOptimizer::mergeRead2Pair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();

// Be careful, since the addresses could be subregisters themselves in weird		// Be careful, since the addresses could be subregisters themselves in weird
// cases, like vectors of pointers.		// cases, like vectors of pointers.
const auto AddrReg = TII->getNamedOperand(CI.I, AMDGPU::OpName::addr);		const auto AddrReg = TII->getNamedOperand(CI.I, AMDGPU::OpName::addr);

const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdst);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdst);
const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdst);		const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdst);
Show All 22 Lines	SILoadStoreOptimizer::mergeRead2Pair(CombineInfo &CI, CombineInfo &Paired,

DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

Register BaseReg = AddrReg->getReg();		Register BaseReg = AddrReg->getReg();
unsigned BaseSubReg = AddrReg->getSubReg();		unsigned BaseSubReg = AddrReg->getSubReg();
unsigned BaseRegFlags = 0;		unsigned BaseRegFlags = 0;
if (CI.BaseOff) {		if (CI.BaseOff) {
Register ImmReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);		Register ImmReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
BuildMI(*MBB, Paired.I, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)		BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)
.addImm(CI.BaseOff);		.addImm(CI.BaseOff);

BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);		BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
BaseRegFlags = RegState::Kill;		BaseRegFlags = RegState::Kill;

TII->getAddNoCarry(*MBB, Paired.I, DL, BaseReg)		TII->getAddNoCarry(*MBB, InsertBefore, DL, BaseReg)
.addReg(ImmReg)		.addReg(ImmReg)
.addReg(AddrReg->getReg(), 0, BaseSubReg)		.addReg(AddrReg->getReg(), 0, BaseSubReg)
.addImm(0); // clamp bit		.addImm(0); // clamp bit
BaseSubReg = 0;		BaseSubReg = 0;
}		}

MachineInstrBuilder Read2 =		MachineInstrBuilder Read2 =
BuildMI(*MBB, Paired.I, DL, Read2Desc, DestReg)		BuildMI(*MBB, InsertBefore, DL, Read2Desc, DestReg)
.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr		.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr
.addImm(NewOffset0) // offset0		.addImm(NewOffset0) // offset0
.addImm(NewOffset1) // offset1		.addImm(NewOffset1) // offset1
.addImm(0) // gds		.addImm(0) // gds
.cloneMergedMemRefs({&CI.I, &Paired.I});		.cloneMergedMemRefs({&CI.I, &Paired.I});

(void)Read2;		(void)Read2;

const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);

// Copy to the old destination registers.		// Copy to the old destination registers.
BuildMI(*MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();

LLVM_DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');		LLVM_DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');
return Read2;		return Read2;
}		}

unsigned SILoadStoreOptimizer::write2Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::write2Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32 : AMDGPU::DS_WRITE2_B64;		return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32 : AMDGPU::DS_WRITE2_B64;
return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32_gfx9		return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32_gfx9
: AMDGPU::DS_WRITE2_B64_gfx9;		: AMDGPU::DS_WRITE2_B64_gfx9;
}		}

unsigned SILoadStoreOptimizer::write2ST64Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::write2ST64Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32		return (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32
: AMDGPU::DS_WRITE2ST64_B64;		: AMDGPU::DS_WRITE2ST64_B64;

return (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32_gfx9		return (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32_gfx9
: AMDGPU::DS_WRITE2ST64_B64_gfx9;		: AMDGPU::DS_WRITE2ST64_B64_gfx9;
}		}

MachineBasicBlock::iterator		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeWrite2Pair(
SILoadStoreOptimizer::mergeWrite2Pair(CombineInfo &CI, CombineInfo &Paired,		CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();

// Be sure to use .addOperand(), and not .addReg() with these. We want to be		// Be sure to use .addOperand(), and not .addReg() with these. We want to be
// sure we preserve the subregister index and any register flags set on them.		// sure we preserve the subregister index and any register flags set on them.
const MachineOperand *AddrReg =		const MachineOperand *AddrReg =
TII->getNamedOperand(*CI.I, AMDGPU::OpName::addr);		TII->getNamedOperand(*CI.I, AMDGPU::OpName::addr);
const MachineOperand *Data0 =		const MachineOperand *Data0 =
TII->getNamedOperand(*CI.I, AMDGPU::OpName::data0);		TII->getNamedOperand(*CI.I, AMDGPU::OpName::data0);
Show All 17 Lines	MachineBasicBlock::iterator SILoadStoreOptimizer::mergeWrite2Pair(
const MCInstrDesc &Write2Desc = TII->get(Opc);		const MCInstrDesc &Write2Desc = TII->get(Opc);
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

Register BaseReg = AddrReg->getReg();		Register BaseReg = AddrReg->getReg();
unsigned BaseSubReg = AddrReg->getSubReg();		unsigned BaseSubReg = AddrReg->getSubReg();
unsigned BaseRegFlags = 0;		unsigned BaseRegFlags = 0;
if (CI.BaseOff) {		if (CI.BaseOff) {
Register ImmReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);		Register ImmReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
BuildMI(*MBB, Paired.I, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)		BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::S_MOV_B32), ImmReg)
.addImm(CI.BaseOff);		.addImm(CI.BaseOff);

BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);		BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
BaseRegFlags = RegState::Kill;		BaseRegFlags = RegState::Kill;

TII->getAddNoCarry(*MBB, Paired.I, DL, BaseReg)		TII->getAddNoCarry(*MBB, InsertBefore, DL, BaseReg)
.addReg(ImmReg)		.addReg(ImmReg)
.addReg(AddrReg->getReg(), 0, BaseSubReg)		.addReg(AddrReg->getReg(), 0, BaseSubReg)
.addImm(0); // clamp bit		.addImm(0); // clamp bit
BaseSubReg = 0;		BaseSubReg = 0;
}		}

MachineInstrBuilder Write2 =		MachineInstrBuilder Write2 =
BuildMI(*MBB, Paired.I, DL, Write2Desc)		BuildMI(*MBB, InsertBefore, DL, Write2Desc)
.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr		.addReg(BaseReg, BaseRegFlags, BaseSubReg) // addr
.add(*Data0) // data0		.add(*Data0) // data0
.add(*Data1) // data1		.add(*Data1) // data1
.addImm(NewOffset0) // offset0		.addImm(NewOffset0) // offset0
.addImm(NewOffset1) // offset1		.addImm(NewOffset1) // offset1
.addImm(0) // gds		.addImm(0) // gds
.cloneMergedMemRefs({&CI.I, &Paired.I});		.cloneMergedMemRefs({&CI.I, &Paired.I});

moveInstsAfter(Write2, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();

LLVM_DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');		LLVM_DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');
return Write2;		return Write2;
}		}

MachineBasicBlock::iterator		MachineBasicBlock::iterator
SILoadStoreOptimizer::mergeImagePair(CombineInfo &CI, CombineInfo &Paired,		SILoadStoreOptimizer::mergeImagePair(CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();
const unsigned Opcode = getNewOpcode(CI, Paired);		const unsigned Opcode = getNewOpcode(CI, Paired);

const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);

Register DestReg = MRI->createVirtualRegister(SuperRC);		Register DestReg = MRI->createVirtualRegister(SuperRC);
unsigned MergedDMask = CI.DMask \| Paired.DMask;		unsigned MergedDMask = CI.DMask \| Paired.DMask;
unsigned DMaskIdx =		unsigned DMaskIdx =
AMDGPU::getNamedOperandIdx(CI.I->getOpcode(), AMDGPU::OpName::dmask);		AMDGPU::getNamedOperandIdx(CI.I->getOpcode(), AMDGPU::OpName::dmask);

auto MIB = BuildMI(*MBB, Paired.I, DL, TII->get(Opcode), DestReg);		auto MIB = BuildMI(*MBB, InsertBefore, DL, TII->get(Opcode), DestReg);
for (unsigned I = 1, E = (*CI.I).getNumOperands(); I != E; ++I) {		for (unsigned I = 1, E = (*CI.I).getNumOperands(); I != E; ++I) {
if (I == DMaskIdx)		if (I == DMaskIdx)
MIB.addImm(MergedDMask);		MIB.addImm(MergedDMask);
else		else
MIB.add((*CI.I).getOperand(I));		MIB.add((*CI.I).getOperand(I));
}		}

// It shouldn't be possible to get this far if the two instructions		// It shouldn't be possible to get this far if the two instructions
Show All 9 Lines	SILoadStoreOptimizer::mergeImagePair(CombineInfo &CI, CombineInfo &Paired,
unsigned SubRegIdx0, SubRegIdx1;		unsigned SubRegIdx0, SubRegIdx1;
std::tie(SubRegIdx0, SubRegIdx1) = getSubRegIdxs(CI, Paired);		std::tie(SubRegIdx0, SubRegIdx1) = getSubRegIdxs(CI, Paired);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);		const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);

BuildMI(*MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();
return New;		return New;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeSBufferLoadImmPair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeSBufferLoadImmPair(
CombineInfo &CI, CombineInfo &Paired,		CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();
const unsigned Opcode = getNewOpcode(CI, Paired);		const unsigned Opcode = getNewOpcode(CI, Paired);

const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);

Register DestReg = MRI->createVirtualRegister(SuperRC);		Register DestReg = MRI->createVirtualRegister(SuperRC);
unsigned MergedOffset = std::min(CI.Offset, Paired.Offset);		unsigned MergedOffset = std::min(CI.Offset, Paired.Offset);

// It shouldn't be possible to get this far if the two instructions		// It shouldn't be possible to get this far if the two instructions
// don't have a single memoperand, because MachineInstr::mayAlias()		// don't have a single memoperand, because MachineInstr::mayAlias()
// will return true if this is the case.		// will return true if this is the case.
assert(CI.I->hasOneMemOperand() && Paired.I->hasOneMemOperand());		assert(CI.I->hasOneMemOperand() && Paired.I->hasOneMemOperand());

const MachineMemOperand MMOa = CI.I->memoperands_begin();		const MachineMemOperand MMOa = CI.I->memoperands_begin();
const MachineMemOperand MMOb = Paired.I->memoperands_begin();		const MachineMemOperand MMOb = Paired.I->memoperands_begin();

MachineInstr *New =		MachineInstr *New =
BuildMI(*MBB, Paired.I, DL, TII->get(Opcode), DestReg)		BuildMI(*MBB, InsertBefore, DL, TII->get(Opcode), DestReg)
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::sbase))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::sbase))
.addImm(MergedOffset) // offset		.addImm(MergedOffset) // offset
.addImm(CI.CPol) // cpol		.addImm(CI.CPol) // cpol
.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));		.addMemOperand(
		combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));

std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI, Paired);		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI, Paired);
const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::sdst);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::sdst);
const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::sdst);		const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::sdst);

BuildMI(*MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();
return New;		return New;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferLoadPair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferLoadPair(
CombineInfo &CI, CombineInfo &Paired,		CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

const unsigned Opcode = getNewOpcode(CI, Paired);		const unsigned Opcode = getNewOpcode(CI, Paired);

const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);

// Copy to the new source register.		// Copy to the new source register.
Register DestReg = MRI->createVirtualRegister(SuperRC);		Register DestReg = MRI->createVirtualRegister(SuperRC);
unsigned MergedOffset = std::min(CI.Offset, Paired.Offset);		unsigned MergedOffset = std::min(CI.Offset, Paired.Offset);

auto MIB = BuildMI(*MBB, Paired.I, DL, TII->get(Opcode), DestReg);		auto MIB = BuildMI(*MBB, InsertBefore, DL, TII->get(Opcode), DestReg);

AddressRegs Regs = getRegs(Opcode, *TII);		AddressRegs Regs = getRegs(Opcode, *TII);

if (Regs.VAddr)		if (Regs.VAddr)
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));

// It shouldn't be possible to get this far if the two instructions		// It shouldn't be possible to get this far if the two instructions
// don't have a single memoperand, because MachineInstr::mayAlias()		// don't have a single memoperand, because MachineInstr::mayAlias()
Show All 16 Lines	MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferLoadPair(
const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);		const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);

BuildMI(*MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();
return New;		return New;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeTBufferLoadPair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeTBufferLoadPair(
CombineInfo &CI, CombineInfo &Paired,		CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

const unsigned Opcode = getNewOpcode(CI, Paired);		const unsigned Opcode = getNewOpcode(CI, Paired);

const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);

// Copy to the new source register.		// Copy to the new source register.
Register DestReg = MRI->createVirtualRegister(SuperRC);		Register DestReg = MRI->createVirtualRegister(SuperRC);
unsigned MergedOffset = std::min(CI.Offset, Paired.Offset);		unsigned MergedOffset = std::min(CI.Offset, Paired.Offset);

auto MIB = BuildMI(*MBB, Paired.I, DL, TII->get(Opcode), DestReg);		auto MIB = BuildMI(*MBB, InsertBefore, DL, TII->get(Opcode), DestReg);

AddressRegs Regs = getRegs(Opcode, *TII);		AddressRegs Regs = getRegs(Opcode, *TII);

if (Regs.VAddr)		if (Regs.VAddr)
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));

unsigned JoinedFormat =		unsigned JoinedFormat =
getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, *STM);		getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, *STM);
Show All 21 Lines	MachineBasicBlock::iterator SILoadStoreOptimizer::mergeTBufferLoadPair(
const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);		const auto Dest1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);

BuildMI(*MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, Paired.I, DL, CopyDesc)		BuildMI(*MBB, InsertBefore, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();
return New;		return New;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeTBufferStorePair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeTBufferStorePair(
CombineInfo &CI, CombineInfo &Paired,		CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

const unsigned Opcode = getNewOpcode(CI, Paired);		const unsigned Opcode = getNewOpcode(CI, Paired);

std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI, Paired);		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI, Paired);
const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the new source register.		// Copy to the new source register.
const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);
Register SrcReg = MRI->createVirtualRegister(SuperRC);		Register SrcReg = MRI->createVirtualRegister(SuperRC);

const auto Src0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Src0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Src1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);		const auto Src1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);

BuildMI(*MBB, Paired.I, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)		BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
.add(*Src0)		.add(*Src0)
.addImm(SubRegIdx0)		.addImm(SubRegIdx0)
.add(*Src1)		.add(*Src1)
.addImm(SubRegIdx1);		.addImm(SubRegIdx1);

auto MIB = BuildMI(*MBB, Paired.I, DL, TII->get(Opcode))		auto MIB = BuildMI(*MBB, InsertBefore, DL, TII->get(Opcode))
.addReg(SrcReg, RegState::Kill);		.addReg(SrcReg, RegState::Kill);

AddressRegs Regs = getRegs(Opcode, *TII);		AddressRegs Regs = getRegs(Opcode, *TII);

if (Regs.VAddr)		if (Regs.VAddr)
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));

unsigned JoinedFormat =		unsigned JoinedFormat =
Show All 13 Lines	MachineInstr *New =
.addImm(std::min(CI.Offset, Paired.Offset)) // offset		.addImm(std::min(CI.Offset, Paired.Offset)) // offset
.addImm(JoinedFormat) // format		.addImm(JoinedFormat) // format
.addImm(CI.CPol) // cpol		.addImm(CI.CPol) // cpol
.addImm(0) // tfe		.addImm(0) // tfe
.addImm(0) // swz		.addImm(0) // swz
.addMemOperand(		.addMemOperand(
combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));		combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));

moveInstsAfter(MIB, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();
return New;		return New;
}		}

unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI,		unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI,
const CombineInfo &Paired) {		const CombineInfo &Paired) {
const unsigned Width = CI.Width + Paired.Width;		const unsigned Width = CI.Width + Paired.Width;
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	SILoadStoreOptimizer::getTargetRegisterClass(const CombineInfo &CI,
unsigned BitWidth = 32 * (CI.Width + Paired.Width);		unsigned BitWidth = 32 * (CI.Width + Paired.Width);
return TRI->isAGPRClass(getDataRegClass(*CI.I))		return TRI->isAGPRClass(getDataRegClass(*CI.I))
? TRI->getAGPRClassForBitWidth(BitWidth)		? TRI->getAGPRClassForBitWidth(BitWidth)
: TRI->getVGPRClassForBitWidth(BitWidth);		: TRI->getVGPRClassForBitWidth(BitWidth);
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferStorePair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeBufferStorePair(
CombineInfo &CI, CombineInfo &Paired,		CombineInfo &CI, CombineInfo &Paired,
const SmallVectorImpl<MachineInstr *> &InstsToMove) {		MachineBasicBlock::iterator InsertBefore) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

const unsigned Opcode = getNewOpcode(CI, Paired);		const unsigned Opcode = getNewOpcode(CI, Paired);

std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI, Paired);		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI, Paired);
const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the new source register.		// Copy to the new source register.
const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI, Paired);
Register SrcReg = MRI->createVirtualRegister(SuperRC);		Register SrcReg = MRI->createVirtualRegister(SuperRC);

const auto Src0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Src0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Src1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);		const auto Src1 = TII->getNamedOperand(Paired.I, AMDGPU::OpName::vdata);

BuildMI(*MBB, Paired.I, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)		BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
.add(*Src0)		.add(*Src0)
.addImm(SubRegIdx0)		.addImm(SubRegIdx0)
.add(*Src1)		.add(*Src1)
.addImm(SubRegIdx1);		.addImm(SubRegIdx1);

auto MIB = BuildMI(*MBB, Paired.I, DL, TII->get(Opcode))		auto MIB = BuildMI(*MBB, InsertBefore, DL, TII->get(Opcode))
.addReg(SrcReg, RegState::Kill);		.addReg(SrcReg, RegState::Kill);

AddressRegs Regs = getRegs(Opcode, *TII);		AddressRegs Regs = getRegs(Opcode, *TII);

if (Regs.VAddr)		if (Regs.VAddr)
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::vaddr));


Show All 9 Lines	MachineInstr *New =
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))
.addImm(std::min(CI.Offset, Paired.Offset)) // offset		.addImm(std::min(CI.Offset, Paired.Offset)) // offset
.addImm(CI.CPol) // cpol		.addImm(CI.CPol) // cpol
.addImm(0) // tfe		.addImm(0) // tfe
.addImm(0) // swz		.addImm(0) // swz
.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));		.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));

moveInstsAfter(MIB, InstsToMove);

CI.I->eraseFromParent();		CI.I->eraseFromParent();
Paired.I->eraseFromParent();		Paired.I->eraseFromParent();
return New;		return New;
}		}

MachineOperand		MachineOperand
SILoadStoreOptimizer::createRegOrImm(int32_t Val, MachineInstr &MI) const {		SILoadStoreOptimizer::createRegOrImm(int32_t Val, MachineInstr &MI) const {
APInt V(32, Val, true);		APInt V(32, Val, true);
▲ Show 20 Lines • Show All 462 Lines • ▼ Show 20 Lines	for (auto I = MergeList.begin(), Next = std::next(I); Next != MergeList.end();
auto First = I;		auto First = I;
auto Second = Next;		auto Second = Next;

if ((First).Order > (Second).Order)		if ((First).Order > (Second).Order)
std::swap(First, Second);		std::swap(First, Second);
CombineInfo &CI = *First;		CombineInfo &CI = *First;
CombineInfo &Paired = *Second;		CombineInfo &Paired = *Second;

SmallVector<MachineInstr *, 8> InstsToMove;		CombineInfo *Where = checkAndPrepareMerge(CI, Paired);
if (!checkAndPrepareMerge(CI, Paired, InstsToMove)) {		if (!Where) {
++I;		++I;
continue;		continue;
}		}

Modified = true;		Modified = true;

LLVM_DEBUG(dbgs() << "Merging: " << CI.I << " with: " << Paired.I);		LLVM_DEBUG(dbgs() << "Merging: " << CI.I << " with: " << Paired.I);

MachineBasicBlock::iterator NewMI;		MachineBasicBlock::iterator NewMI;
switch (CI.InstClass) {		switch (CI.InstClass) {
default:		default:
llvm_unreachable("unknown InstClass");		llvm_unreachable("unknown InstClass");
break;		break;
case DS_READ:		case DS_READ:
NewMI = mergeRead2Pair(CI, Paired, InstsToMove);		NewMI = mergeRead2Pair(CI, Paired, Where->I);
break;		break;
case DS_WRITE:		case DS_WRITE:
NewMI = mergeWrite2Pair(CI, Paired, InstsToMove);		NewMI = mergeWrite2Pair(CI, Paired, Where->I);
break;		break;
case S_BUFFER_LOAD_IMM:		case S_BUFFER_LOAD_IMM:
NewMI = mergeSBufferLoadImmPair(CI, Paired, InstsToMove);		NewMI = mergeSBufferLoadImmPair(CI, Paired, Where->I);
OptimizeListAgain \|= CI.Width + Paired.Width < 8;		OptimizeListAgain \|= CI.Width + Paired.Width < 8;
break;		break;
case BUFFER_LOAD:		case BUFFER_LOAD:
NewMI = mergeBufferLoadPair(CI, Paired, InstsToMove);		NewMI = mergeBufferLoadPair(CI, Paired, Where->I);
OptimizeListAgain \|= CI.Width + Paired.Width < 4;		OptimizeListAgain \|= CI.Width + Paired.Width < 4;
break;		break;
case BUFFER_STORE:		case BUFFER_STORE:
NewMI = mergeBufferStorePair(CI, Paired, InstsToMove);		NewMI = mergeBufferStorePair(CI, Paired, Where->I);
OptimizeListAgain \|= CI.Width + Paired.Width < 4;		OptimizeListAgain \|= CI.Width + Paired.Width < 4;
break;		break;
case MIMG:		case MIMG:
NewMI = mergeImagePair(CI, Paired, InstsToMove);		NewMI = mergeImagePair(CI, Paired, Where->I);
OptimizeListAgain \|= CI.Width + Paired.Width < 4;		OptimizeListAgain \|= CI.Width + Paired.Width < 4;
break;		break;
case TBUFFER_LOAD:		case TBUFFER_LOAD:
NewMI = mergeTBufferLoadPair(CI, Paired, InstsToMove);		NewMI = mergeTBufferLoadPair(CI, Paired, Where->I);
OptimizeListAgain \|= CI.Width + Paired.Width < 4;		OptimizeListAgain \|= CI.Width + Paired.Width < 4;
break;		break;
case TBUFFER_STORE:		case TBUFFER_STORE:
NewMI = mergeTBufferStorePair(CI, Paired, InstsToMove);		NewMI = mergeTBufferStorePair(CI, Paired, Where->I);
OptimizeListAgain \|= CI.Width + Paired.Width < 4;		OptimizeListAgain \|= CI.Width + Paired.Width < 4;
break;		break;
}		}
CI.setMI(NewMI, *this);		CI.setMI(NewMI, *this);
CI.Order = Paired.Order;		CI.Order = Where->Order;
if (I == Second)		if (I == Second)
I = Next;		I = Next;

MergeList.erase(Second);		MergeList.erase(Second);
}		}

return Modified;		return Modified;
}		}
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds-combine-with-dependence.ll

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @ds_combine_WAR(float addrspace(1)* %out, float addrspace(3)* %inptr) {
%v1 = load float, float addrspace(3)* %vaddr1, align 4		%v1 = load float, float addrspace(3)* %vaddr1, align 4

%sum = fadd float %v0, %v1		%sum = fadd float %v0, %v1
store float %sum, float addrspace(1)* %out, align 4		store float %sum, float addrspace(1)* %out, align 4
ret void		ret void
}		}


; The second load depends on the store. We can combine the two loads, and the combined load is		; The second load depends on the store. We could combine the two loads, putting
; at the original place of the second load.		; the combined load at the original place of the second load, but we prefer to
		; leave the first load near the start of the function to hide its latency.

; GCN-LABEL: {{^}}ds_combine_RAW		; GCN-LABEL: {{^}}ds_combine_RAW

; GCN: ds_write2_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset0:26 offset1:27		; GCN: ds_write2_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset0:26 offset1:27
; GCN-NEXT: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}} offset0:8 offset1:26		; GCN-NEXT: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:32
		; GCN-NEXT: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:104
define amdgpu_kernel void @ds_combine_RAW(float addrspace(1)* %out, float addrspace(3)* %inptr) {		define amdgpu_kernel void @ds_combine_RAW(float addrspace(1)* %out, float addrspace(3)* %inptr) {

%base = bitcast float addrspace(3)* %inptr to i8 addrspace(3)*		%base = bitcast float addrspace(3)* %inptr to i8 addrspace(3)*
%addr0 = getelementptr i8, i8 addrspace(3)* %base, i32 24		%addr0 = getelementptr i8, i8 addrspace(3)* %base, i32 24
%tmp0 = bitcast i8 addrspace(3)* %addr0 to float addrspace(3)*		%tmp0 = bitcast i8 addrspace(3)* %addr0 to float addrspace(3)*
%vaddr0 = bitcast float addrspace(3)* %tmp0 to <3 x float> addrspace(3)*		%vaddr0 = bitcast float addrspace(3)* %tmp0 to <3 x float> addrspace(3)*
%load0 = load <3 x float>, <3 x float> addrspace(3)* %vaddr0, align 4		%load0 = load <3 x float>, <3 x float> addrspace(3)* %vaddr0, align 4
%v0 = extractelement <3 x float> %load0, i32 2		%v0 = extractelement <3 x float> %load0, i32 2
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds_read2.ll

	Show First 20 Lines • Show All 1,238 Lines • ▼ Show 20 Lines
	}			}

	define amdgpu_kernel void @ds_read_diff_base_interleaving(			define amdgpu_kernel void @ds_read_diff_base_interleaving(
	; CI-LABEL: ds_read_diff_base_interleaving:			; CI-LABEL: ds_read_diff_base_interleaving:
	; CI: ; %bb.0: ; %bb			; CI: ; %bb.0: ; %bb
	; CI-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2			; CI-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2
	; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0			; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0
	; CI-NEXT: v_lshlrev_b32_e32 v1, 4, v1			; CI-NEXT: v_lshlrev_b32_e32 v1, 4, v1
	; CI-NEXT: v_lshlrev_b32_e32 v4, 2, v0			; CI-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; CI-NEXT: s_mov_b32 m0, -1			; CI-NEXT: s_mov_b32 m0, -1
	; CI-NEXT: s_waitcnt lgkmcnt(0)			; CI-NEXT: s_waitcnt lgkmcnt(0)
	; CI-NEXT: v_add_i32_e32 v2, vcc, s4, v1			; CI-NEXT: v_add_i32_e32 v2, vcc, s4, v1
	; CI-NEXT: v_add_i32_e32 v3, vcc, s5, v4			; CI-NEXT: v_add_i32_e32 v3, vcc, s5, v0
	; CI-NEXT: v_add_i32_e32 v5, vcc, s6, v1			; CI-NEXT: v_add_i32_e32 v4, vcc, s6, v1
				; CI-NEXT: v_add_i32_e32 v6, vcc, s7, v0
	; CI-NEXT: ds_read2_b32 v[0:1], v2 offset1:1			; CI-NEXT: ds_read2_b32 v[0:1], v2 offset1:1
	; CI-NEXT: ds_read2_b32 v[2:3], v3 offset1:4			; CI-NEXT: ds_read2_b32 v[2:3], v3 offset1:4
	; CI-NEXT: v_add_i32_e32 v6, vcc, s7, v4			; CI-NEXT: ds_read2_b32 v[4:5], v4 offset1:1
	; CI-NEXT: ds_read2_b32 v[4:5], v5 offset1:1
	; CI-NEXT: ds_read2_b32 v[6:7], v6 offset1:4			; CI-NEXT: ds_read2_b32 v[6:7], v6 offset1:4
	; CI-NEXT: s_mov_b32 s3, 0xf000			; CI-NEXT: s_mov_b32 s3, 0xf000
				; CI-NEXT: s_mov_b32 s2, -1
	; CI-NEXT: s_waitcnt lgkmcnt(2)			; CI-NEXT: s_waitcnt lgkmcnt(2)
	; CI-NEXT: v_mul_f32_e32 v0, v0, v2			; CI-NEXT: v_mul_f32_e32 v0, v0, v2
	; CI-NEXT: v_add_f32_e32 v0, 2.0, v0			; CI-NEXT: v_add_f32_e32 v0, 2.0, v0
	; CI-NEXT: v_mul_f32_e32 v1, v1, v3
	; CI-NEXT: s_waitcnt lgkmcnt(0)			; CI-NEXT: s_waitcnt lgkmcnt(0)
	; CI-NEXT: v_mul_f32_e32 v2, v4, v6			; CI-NEXT: v_mul_f32_e32 v2, v4, v6
	; CI-NEXT: v_sub_f32_e32 v0, v0, v2			; CI-NEXT: v_sub_f32_e32 v0, v0, v2
				; CI-NEXT: v_mul_f32_e32 v1, v1, v3
	; CI-NEXT: v_sub_f32_e32 v0, v0, v1			; CI-NEXT: v_sub_f32_e32 v0, v0, v1
	; CI-NEXT: v_mul_f32_e32 v1, v5, v7			; CI-NEXT: v_mul_f32_e32 v1, v5, v7
	; CI-NEXT: s_mov_b32 s2, -1
	; CI-NEXT: v_sub_f32_e32 v0, v0, v1			; CI-NEXT: v_sub_f32_e32 v0, v0, v1
	; CI-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:40			; CI-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:40
	; CI-NEXT: s_endpgm			; CI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: ds_read_diff_base_interleaving:			; GFX9-LABEL: ds_read_diff_base_interleaving:
	; GFX9: ; %bb.0: ; %bb			; GFX9: ; %bb.0: ; %bb
	; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x8			; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x8
	; GFX9-NEXT: v_lshlrev_b32_e32 v1, 4, v1			; GFX9-NEXT: v_lshlrev_b32_e32 v1, 4, v1
	▲ Show 20 Lines • Show All 284 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa.ll

	; RUN: llc -march=amdgcn -mcpu=gfx900 -O3 < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=gfx900 -O3 < %s \| FileCheck -check-prefix=GCN %s
	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s
	; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

	@a = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4			@a = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4
	@b = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4			@b = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4
	@c = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4			@c = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 4

				; FIXME: Should combine the DS instructions into ds_write2 and ds_read2. This
				; does not happen because when SILoadStoreOptimizer is run, the reads and writes
				; are not adjacent. They are only moved later by MachineScheduler.

	; GCN-LABEL: {{^}}no_clobber_ds_load_stores_x2:			; GCN-LABEL: {{^}}no_clobber_ds_load_stores_x2:
	; GCN: ds_write2st64_b32			; GCN: ds_write_b32
	; GCN: ds_read2st64_b32			; GCN: ds_write_b32
				; GCN: ds_read_b32
				; GCN: ds_read_b32

	; CHECK-LABEL: @no_clobber_ds_load_stores_x2			; CHECK-LABEL: @no_clobber_ds_load_stores_x2
	; CHECK: store i32 1, i32 addrspace(3)* %0, align 16, !alias.scope !0, !noalias !3			; CHECK: store i32 1, i32 addrspace(3)* %0, align 16, !alias.scope !0, !noalias !3
	; CHECK: %val.a = load i32, i32 addrspace(3)* %gep.a, align 4, !alias.scope !0, !noalias !3			; CHECK: %val.a = load i32, i32 addrspace(3)* %gep.a, align 4, !alias.scope !0, !noalias !3
	; CHECK: store i32 2, i32 addrspace(3)* %1, align 16, !alias.scope !3, !noalias !0			; CHECK: store i32 2, i32 addrspace(3)* %1, align 16, !alias.scope !3, !noalias !0
	; CHECK: %val.b = load i32, i32 addrspace(3)* %gep.b, align 4, !alias.scope !3, !noalias !0			; CHECK: %val.b = load i32, i32 addrspace(3)* %gep.b, align 4, !alias.scope !3, !noalias !0

	define amdgpu_kernel void @no_clobber_ds_load_stores_x2(i32 addrspace(1)* %arg, i32 %i) {			define amdgpu_kernel void @no_clobber_ds_load_stores_x2(i32 addrspace(1)* %arg, i32 %i) {
	bb:			bb:
	store i32 1, i32 addrspace(3)* getelementptr inbounds ([64 x i32], [64 x i32] addrspace(3)* @a, i32 0, i32 0), align 4			store i32 1, i32 addrspace(3)* getelementptr inbounds ([64 x i32], [64 x i32] addrspace(3)* @a, i32 0, i32 0), align 4
	%gep.a = getelementptr inbounds [64 x i32], [64 x i32] addrspace(3)* @a, i32 0, i32 %i			%gep.a = getelementptr inbounds [64 x i32], [64 x i32] addrspace(3)* @a, i32 0, i32 %i
	%val.a = load i32, i32 addrspace(3)* %gep.a, align 4			%val.a = load i32, i32 addrspace(3)* %gep.a, align 4
	store i32 2, i32 addrspace(3)* getelementptr inbounds ([64 x i32], [64 x i32] addrspace(3)* @b, i32 0, i32 0), align 4			store i32 2, i32 addrspace(3)* getelementptr inbounds ([64 x i32], [64 x i32] addrspace(3)* @b, i32 0, i32 0), align 4
	%gep.b = getelementptr inbounds [64 x i32], [64 x i32] addrspace(3)* @b, i32 0, i32 %i			%gep.b = getelementptr inbounds [64 x i32], [64 x i32] addrspace(3)* @b, i32 0, i32 %i
	%val.b = load i32, i32 addrspace(3)* %gep.b, align 4			%val.b = load i32, i32 addrspace(3)* %gep.b, align 4
	%val = add i32 %val.a, %val.b			%val = add i32 %val.a, %val.b
	store i32 %val, i32 addrspace(1)* %arg, align 4			store i32 %val, i32 addrspace(1)* %arg, align 4
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}no_clobber_ds_load_stores_x3:			; GCN-LABEL: {{^}}no_clobber_ds_load_stores_x3:
	; GCN-DAG: ds_write2st64_b32
	; GCN-DAG: ds_write_b32			; GCN-DAG: ds_write_b32
	; GCN-DAG: ds_read2st64_b32			; GCN-DAG: ds_write_b32
				; GCN-DAG: ds_write_b32
				; GCN-DAG: ds_read_b32
				; GCN-DAG: ds_read_b32
	; GCN-DAG: ds_read_b32			; GCN-DAG: ds_read_b32

	; CHECK-LABEL: @no_clobber_ds_load_stores_x3			; CHECK-LABEL: @no_clobber_ds_load_stores_x3
	; CHECK: store i32 1, i32 addrspace(3)* %0, align 16, !alias.scope !5, !noalias !8			; CHECK: store i32 1, i32 addrspace(3)* %0, align 16, !alias.scope !5, !noalias !8
	; CHECK: %val.a = load i32, i32 addrspace(3)* %gep.a, align 4, !alias.scope !5, !noalias !8			; CHECK: %val.a = load i32, i32 addrspace(3)* %gep.a, align 4, !alias.scope !5, !noalias !8
	; CHECK: store i32 2, i32 addrspace(3)* %1, align 16, !alias.scope !11, !noalias !12			; CHECK: store i32 2, i32 addrspace(3)* %1, align 16, !alias.scope !11, !noalias !12
	; CHECK: %val.b = load i32, i32 addrspace(3)* %gep.b, align 4, !alias.scope !11, !noalias !12			; CHECK: %val.b = load i32, i32 addrspace(3)* %gep.b, align 4, !alias.scope !11, !noalias !12
	; CHECK: store i32 3, i32 addrspace(3)* %2, align 16, !alias.scope !13, !noalias !14			; CHECK: store i32 3, i32 addrspace(3)* %2, align 16, !alias.scope !13, !noalias !14
	Show All 34 Lines

llvm/test/CodeGen/AMDGPU/merge-load-store-physreg.mir

	# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-load-store-opt -o - %s \| FileCheck %s			# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-load-store-opt -o - %s \| FileCheck %s

	# Check that SILoadStoreOptimizer honors physregs defs/uses between moved			# Check that SILoadStoreOptimizer honors physregs defs/uses between moved
	# instructions.			# instructions.
	#			#
	# The following IR snippet would usually be optimized by the peephole optimizer.			# The following IR snippet would usually be optimized by the peephole optimizer.
	# However, an equivalent situation can occur with buffer instructions as well.			# However, an equivalent situation can occur with buffer instructions as well.

	# CHECK-LABEL: name: scc_def_and_use_no_dependency			# CHECK-LABEL: name: scc_def_and_use_no_dependency
				# CHECK: DS_READ2_B32
	# CHECK: S_ADD_U32			# CHECK: S_ADD_U32
	# CHECK: S_ADDC_U32			# CHECK: S_ADDC_U32
	# CHECK: DS_READ2_B32
	---			---
	name: scc_def_and_use_no_dependency			name: scc_def_and_use_no_dependency
	machineFunctionInfo:			machineFunctionInfo:
	isEntryFunction: true			isEntryFunction: true
	body: \|			body: \|
	bb.0:			bb.0:
	liveins: $vgpr0, $sgpr0			liveins: $vgpr0, $sgpr0

	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/merge-out-of-order-ldst.mir

	# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass si-load-store-opt %s -o - \| FileCheck -check-prefix=GCN %s			# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass si-load-store-opt %s -o - \| FileCheck -check-prefix=GCN %s

	# GCN-LABEL: name: out_of_order_merge			# GCN-LABEL: name: out_of_order_merge
	# GCN: DS_READ2_B64_gfx9			# GCN: DS_READ2_B64_gfx9
	# GCN: DS_WRITE_B64_gfx9
	# GCN: DS_READ2_B64_gfx9			# GCN: DS_READ2_B64_gfx9
	# GCN: DS_WRITE_B64_gfx9			# GCN: DS_WRITE2_B64_gfx9
	# GCN: DS_WRITE_B64_gfx9			# GCN: DS_WRITE_B64_gfx9
	---			---
	name: out_of_order_merge			name: out_of_order_merge
	body: \|			body: \|
	bb.0:			bb.0:
	%4:vgpr_32 = V_MOV_B32_e32 0, implicit $exec			%4:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
	%5:vreg_64 = DS_READ_B64_gfx9 %4, 776, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef`, addrspace 3)			%5:vreg_64 = DS_READ_B64_gfx9 %4, 776, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef`, addrspace 3)
	%6:vreg_64 = DS_READ_B64_gfx9 %4, 784, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef` + 8, addrspace 3)			%6:vreg_64 = DS_READ_B64_gfx9 %4, 784, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef` + 8, addrspace 3)
	%17:vreg_64 = DS_READ_B64_gfx9 %4, 840, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef`, addrspace 3)			%17:vreg_64 = DS_READ_B64_gfx9 %4, 840, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef`, addrspace 3)
	DS_WRITE_B64_gfx9 %4, %17, 8, 0, implicit $exec :: (store (s64) into `double addrspace(3)* undef` + 8, addrspace 3)			DS_WRITE_B64_gfx9 %4, %17, 8, 0, implicit $exec :: (store (s64) into `double addrspace(3)* undef` + 8, addrspace 3)
	DS_WRITE_B64_gfx9 %4, %6, 0, 0, implicit $exec :: (store (s64) into `double addrspace(3)* undef`, align 16, addrspace 3)			DS_WRITE_B64_gfx9 %4, %6, 0, 0, implicit $exec :: (store (s64) into `double addrspace(3)* undef`, align 16, addrspace 3)
	%24:vreg_64 = DS_READ_B64_gfx9 %4, 928, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef` + 8, addrspace 3)			%24:vreg_64 = DS_READ_B64_gfx9 %4, 928, 0, implicit $exec :: (load (s64) from `double addrspace(3)* undef` + 8, addrspace 3)
	DS_WRITE_B64_gfx9 undef %29:vgpr_32, %5, 0, 0, implicit $exec :: (store (s64) into `double addrspace(3)* undef`, addrspace 3)			DS_WRITE_B64_gfx9 undef %29:vgpr_32, %5, 0, 0, implicit $exec :: (store (s64) into `double addrspace(3)* undef`, addrspace 3)
	S_ENDPGM 0			S_ENDPGM 0

	...			...

llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir

Show First 20 Lines • Show All 774 Lines • ▼ Show 20 Lines	bb.0.entry:
%5:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3		%5:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3
%7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 4, 116, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		%7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 4, 116, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
%8:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 8, 116, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		%8:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 8, 116, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
...		...
---		---


# GFX9-LABEL: name: gfx9_tbuffer_load_merge_across_swizzle		# GFX9-LABEL: name: gfx9_tbuffer_load_merge_across_swizzle
# GFX9: %{{[0-9]+}}:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4, 0, 12, 116, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		# GFX9-DAG: %{{[0-9]+}}:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4, 0, 12, 116, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
# GFX9: %{{[0-9]+}}:vreg_64 = TBUFFER_LOAD_FORMAT_XY_OFFSET %4, 0, 4, 123, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 4)		# GFX9-DAG: %{{[0-9]+}}:vreg_64 = TBUFFER_LOAD_FORMAT_XY_OFFSET %4, 0, 4, 123, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 4)
name: gfx9_tbuffer_load_merge_across_swizzle		name: gfx9_tbuffer_load_merge_across_swizzle
body: \|		body: \|
bb.0.entry:		bb.0.entry:
%0:sgpr_32 = COPY $sgpr0		%0:sgpr_32 = COPY $sgpr0
%1:sgpr_32 = COPY $sgpr1		%1:sgpr_32 = COPY $sgpr1
%2:sgpr_32 = COPY $sgpr2		%2:sgpr_32 = COPY $sgpr2
%3:sgpr_32 = COPY $sgpr3		%3:sgpr_32 = COPY $sgpr3
%4:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3		%4:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3
▲ Show 20 Lines • Show All 799 Lines • ▼ Show 20 Lines	bb.0.entry:
%5:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3		%5:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3
%7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 4, 22, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		%7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 4, 22, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
%8:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 8, 22, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		%8:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %5:sgpr_128, 0, 8, 22, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
...		...
---		---


# GFX10-LABEL: name: gfx10_tbuffer_load_merge_across_swizzle		# GFX10-LABEL: name: gfx10_tbuffer_load_merge_across_swizzle
# GFX10: %{{[0-9]+}}:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4, 0, 12, 22, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		# GFX10-DAG: %{{[0-9]+}}:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4, 0, 12, 22, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
# GFX10: %{{[0-9]+}}:vreg_64 = TBUFFER_LOAD_FORMAT_XY_OFFSET %4, 0, 4, 64, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 4)		# GFX10-DAG: %{{[0-9]+}}:vreg_64 = TBUFFER_LOAD_FORMAT_XY_OFFSET %4, 0, 4, 64, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 4)
name: gfx10_tbuffer_load_merge_across_swizzle		name: gfx10_tbuffer_load_merge_across_swizzle
body: \|		body: \|
bb.0.entry:		bb.0.entry:
%0:sgpr_32 = COPY $sgpr0		%0:sgpr_32 = COPY $sgpr0
%1:sgpr_32 = COPY $sgpr1		%1:sgpr_32 = COPY $sgpr1
%2:sgpr_32 = COPY $sgpr2		%2:sgpr_32 = COPY $sgpr2
%3:sgpr_32 = COPY $sgpr3		%3:sgpr_32 = COPY $sgpr3
%4:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3		%4:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3
%5:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4:sgpr_128, 0, 4, 22, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		%5:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4:sgpr_128, 0, 4, 22, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
%6:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4:sgpr_128, 0, 12, 22, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		%6:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4:sgpr_128, 0, 12, 22, 0, 0, 1, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
%7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4:sgpr_128, 0, 8, 22, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)		%7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFSET %4:sgpr_128, 0, 8, 22, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
...		...
---		---

llvm/test/CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=bonaire -enable-amdgpu-aa=0 -verify-machineinstrs -enable-misched -enable-aa-sched-mi < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=bonaire -enable-amdgpu-aa=0 -verify-machineinstrs -enable-misched -enable-aa-sched-mi < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CI %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=gfx900 -enable-amdgpu-aa=0 -verify-machineinstrs -enable-misched -enable-aa-sched-mi < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=gfx900 -enable-amdgpu-aa=0 -verify-machineinstrs -enable-misched -enable-aa-sched-mi < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s

	@stored_lds_ptr = addrspace(3) global i32 addrspace(3)* undef, align 4			@stored_lds_ptr = addrspace(3) global i32 addrspace(3)* undef, align 4
	@stored_constant_ptr = addrspace(3) global i32 addrspace(4)* undef, align 8			@stored_constant_ptr = addrspace(3) global i32 addrspace(4)* undef, align 8
	@stored_global_ptr = addrspace(3) global i32 addrspace(1)* undef, align 8			@stored_global_ptr = addrspace(3) global i32 addrspace(1)* undef, align 8

	; GCN-LABEL: {{^}}reorder_local_load_global_store_local_load:			; GCN-LABEL: {{^}}reorder_local_load_global_store_local_load:
	; CI: ds_read2_b32 {{v\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}} offset0:1 offset1:3			; CI: ds_read2_b32 {{v\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}} offset0:1 offset1:3
	; CI: buffer_store_dword			; CI: buffer_store_dword

	; GFX9: global_store_dword
	; GFX9: ds_read2_b32 {{v\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}} offset0:1 offset1:3			; GFX9: ds_read2_b32 {{v\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}} offset0:1 offset1:3
	; GFX9: global_store_dword			; GFX9: global_store_dword
				; GFX9: global_store_dword
	define amdgpu_kernel void @reorder_local_load_global_store_local_load(i32 addrspace(1)* %out, i32 addrspace(1)* %gptr) #0 {			define amdgpu_kernel void @reorder_local_load_global_store_local_load(i32 addrspace(1)* %out, i32 addrspace(1)* %gptr) #0 {
	%ptr0 = load i32 addrspace(3), i32 addrspace(3) addrspace(3)* @stored_lds_ptr, align 4			%ptr0 = load i32 addrspace(3), i32 addrspace(3) addrspace(3)* @stored_lds_ptr, align 4

	%ptr1 = getelementptr inbounds i32, i32 addrspace(3)* %ptr0, i32 1			%ptr1 = getelementptr inbounds i32, i32 addrspace(3)* %ptr0, i32 1
	%ptr2 = getelementptr inbounds i32, i32 addrspace(3)* %ptr0, i32 3			%ptr2 = getelementptr inbounds i32, i32 addrspace(3)* %ptr0, i32 3

	%tmp1 = load i32, i32 addrspace(3)* %ptr1, align 4			%tmp1 = load i32, i32 addrspace(3)* %ptr1, align 4
	store i32 99, i32 addrspace(1)* %gptr, align 4			store i32 99, i32 addrspace(1)* %gptr, align 4
	▲ Show 20 Lines • Show All 300 Lines • Show Last 20 Lines