This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SILoadStoreOptimizer: Optimize scanning for mergeable instructions
ClosedPublic

Authored by tstellar on Aug 8 2019, 10:39 AM.

Download Raw Diff

Details

Reviewers

arsenm
pendingchaos
rampitec
nhaehnle
vpykhtin

Commits

rGe6f51713054f: AMDGPU/SILoadStoreOptimizer: Optimize scanning for mergeable instructions
rL373630: AMDGPU/SILoadStoreOptimizer: Optimize scanning for mergeable instructions

Summary

This adds a pre-pass to this optimization that scans through the basic
block and generates lists of mergeable instructions with one list per unique
address.

In the optimization phase instead of scanning through the basic block for mergeable
instructions, we now iterate over the lists generated by the pre-pass.

The decision to re-optimize a block is now made per list, so if we fail to merge any
instructions with the same address, then we do not attempt to optimize them in
future passes over the block. This will help to reduce the time this pass
spends re-optimizing instructions.

In one pathological test case, this change reduces the time spent in the
SILoadStoreOptimizer from 0.2s to 0.03s.

This restructuring will also make it possible to implement further solutions in
this pass, because we can now add less expensive checks to the pre-pass and
filter instructions out early which will avoid the need to do the expensive
scanning during the optimization pass. For example, checking for adjacent
offsets is an inexpensive test we can move to the pre-pass.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 36442
Build 36441: arc lint + arc unit

Event Timeline

tstellar created this revision.Aug 8 2019, 10:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2019, 10:39 AM

Herald added subscribers: hiraditya, t-tye, tpr and 5 others. · View Herald Transcript

Harbormaster completed remote builds in B36442: Diff 214187.Aug 8 2019, 10:39 AM

tstellar added parent revisions: D65496: AMDGPU/SILoadStoreOptimizer: Add helper functions for working with CombineInfo, D65097: AMDGPU: Add offsets to MMO when lowering buffer intrinsics, D65901: AMDGPU/SILoadStoreOptimizer: Add const to more functions.Aug 8 2019, 12:36 PM

tstellar added a child revision: D65966: AMDGPU/SILoadStoreOptimizer: Improve merging of out of order offsets.Aug 8 2019, 12:51 PM

arsenm added inline comments.Sep 5 2019, 11:49 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
181	Typo s/on/no
1589	Why std::list, and a std::list of lists?

tstellar marked an inline comment as done.Sep 13 2019, 5:38 PM

tstellar added inline comments.

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
1589	The main reason to use lists is so I can remove items without invalidating iterators.

LGTM but I would rather avoid the list usage

This revision is now accepted and ready to land.Sep 19 2019, 5:05 PM

In D65961#1675758, @arsenm wrote:

LGTM but I would rather avoid the list usage

What do you think would be a better alternative? The operations used iterate, emplace_back, size, and erase.

In D65961#1692092, @tstellar wrote:

In D65961#1675758, @arsenm wrote:

LGTM but I would rather avoid the list usage

What do you think would be a better alternative? The operations used iterate, emplace_back, size, and erase.

Closed by commit rL373630: AMDGPU/SILoadStoreOptimizer: Optimize scanning for mergeable instructions (authored by tstellar). · Explain WhyOct 3 2019, 10:10 AM

This revision was automatically updated to reflect the committed changes.

Thank you for doing this, it seems quite useful. As a follow-up to this change, do you think it makes sense to refactor CombineInfo a bit? We have a list of mergeable instructions, but the CombineInfo structure also has fields for a second instruction, which are only for temporary use, which is a bit odd.

nhaehnle added a child revision: D68690: AMDGPU/SILoadStoreOptimizer: fix a likely bug introduced recently.Oct 9 2019, 3:57 AM

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

In D65961#1763143, @foad wrote:

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

The intention was to not change the behavior at all.

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

Do you have a MIR or .ll dump of the shader I could look at ? Also, does https://reviews.llvm.org/D65966 help?

In D65961#1765026, @tstellar wrote:

In D65961#1763143, @foad wrote:

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

The intention was to not change the behavior at all.

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

Do you have a MIR or .ll dump of the shader I could look at ? Also, does https://reviews.llvm.org/D65966 help?

P8173 is a MIR test case. See the RUN line for how to run it. I see significant differences in the placing of the merged BUFFER_LOAD instructions with/without D65961 (or before/after it was committed).

I tried applying D65966 on top of rGe6f51713054f but it made no difference to the output.

In D65961#1766760, @foad wrote:

In D65961#1765026, @tstellar wrote:

In D65961#1763143, @foad wrote:

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

The intention was to not change the behavior at all.

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

Do you have a MIR or .ll dump of the shader I could look at ? Also, does https://reviews.llvm.org/D65966 help?

P8173 is a MIR test case. See the RUN line for how to run it. I see significant differences in the placing of the merged BUFFER_LOAD instructions with/without D65961 (or before/after it was committed).

I tried applying D65966 on top of rGe6f51713054f but it made no difference to the output.

You need to apply it on top of 3a8d80944b7766449e2c8784a8fb30d19a2ba16c or newer for it to have an impact. When I do that, D65966 does help improve the merging.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SILoadStoreOptimizer.cpp

263 lines

Diff 214187

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	bool hasSameBaseAddress(const MachineInstr &MI) {
if (AddrReg[i]->getReg() != AddrRegNext.getReg() \|\|		if (AddrReg[i]->getReg() != AddrRegNext.getReg() \|\|
AddrReg[i]->getSubReg() != AddrRegNext.getSubReg()) {		AddrReg[i]->getSubReg() != AddrRegNext.getSubReg()) {
return false;		return false;
}		}
}		}
return true;		return true;
}		}

		bool hasMergeableAddress(const MachineRegisterInfo &MRI) {
		for (unsigned i = 0; i < NumAddresses; ++i) {
		const MachineOperand *AddrOp = AddrReg[i];
		// Immediates are always OK.
		if (AddrOp->isImm())
		continue;

		// Don't try to merge addresses that aren't either immediates or registers.
		// TODO: Should be possible to merge FrameIndexes and maybe some other
		// non-register
		if (!AddrOp->isReg())
		return false;

		// TODO: We should be able to merge physical reg addreses.
		if (Register::isPhysicalRegister(AddrOp->getReg()))
		return false;

		// If an address has only one use then there will be on other
		arsenmUnsubmitted Not Done Reply Inline Actions Typo s/on/no arsenm: Typo s/on/no
		// instructions with the same address, so we can't merge this one.
		if (MRI.hasOneNonDBGUse(AddrOp->getReg()))
		return false;
		}
		return true;
		}

void setMI(MachineBasicBlock::iterator MI, const SIInstrInfo &TII,		void setMI(MachineBasicBlock::iterator MI, const SIInstrInfo &TII,
const GCNSubtarget &STM);		const GCNSubtarget &STM);
void setPaired(MachineBasicBlock::iterator MI, const SIInstrInfo &TII);		void setPaired(MachineBasicBlock::iterator MI, const SIInstrInfo &TII);
};		};

struct BaseRegisters {		struct BaseRegisters {
unsigned LoReg = 0;		unsigned LoReg = 0;
unsigned HiReg = 0;		unsigned HiReg = 0;
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	private:
Optional<int32_t> extractConstOffset(const MachineOperand &Op) const;		Optional<int32_t> extractConstOffset(const MachineOperand &Op) const;
void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr) const;		void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr) const;
/// Promotes constant offset to the immediate by adjusting the base. It		/// Promotes constant offset to the immediate by adjusting the base. It
/// tries to use a base from the nearby instructions that allows it to have		/// tries to use a base from the nearby instructions that allows it to have
/// a 13bit constant offset which gets promoted to the immediate.		/// a 13bit constant offset which gets promoted to the immediate.
bool promoteConstantOffsetToImm(MachineInstr &CI,		bool promoteConstantOffsetToImm(MachineInstr &CI,
MemInfoMap &Visited,		MemInfoMap &Visited,
SmallPtrSet<MachineInstr *, 4> &Promoted) const;		SmallPtrSet<MachineInstr *, 4> &Promoted) const;
		void addInstToMergeableList(const CombineInfo &CI,
		std::list<std::list<CombineInfo> > &MergeableInsts) const;
		bool collectMergeableInsts(MachineBasicBlock &MBB,
		std::list<std::list<CombineInfo> > &MergeableInsts) const;

public:		public:
static char ID;		static char ID;

SILoadStoreOptimizer() : MachineFunctionPass(ID) {		SILoadStoreOptimizer() : MachineFunctionPass(ID) {
initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());		initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());
}		}

bool optimizeBlock(MachineBasicBlock &MBB);		void removeCombinedInst(std::list<CombineInfo> &MergeList,
		const MachineInstr &MI);
		bool optimizeInstsWithSameBaseAddr(std::list<CombineInfo> &MergeList,
		bool &OptimizeListAgain);
		bool optimizeBlock(std::list<std::list<CombineInfo> > &MergeableInsts);

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override { return "SI Load Store Optimizer"; }		StringRef getPassName() const override { return "SI Load Store Optimizer"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
AU.addRequired<AAResultsWrapperPass>();		AU.addRequired<AAResultsWrapperPass>();
▲ Show 20 Lines • Show All 179 Lines • ▼ Show 20 Lines	void SILoadStoreOptimizer::CombineInfo::setMI(MachineBasicBlock::iterator MI,
if (Regs & VADDR) {		if (Regs & VADDR) {
AddrOpName[NumAddresses++] = AMDGPU::OpName::vaddr;		AddrOpName[NumAddresses++] = AMDGPU::OpName::vaddr;
}		}

for (unsigned i = 0; i < NumAddresses; i++) {		for (unsigned i = 0; i < NumAddresses; i++) {
AddrIdx[i] = AMDGPU::getNamedOperandIdx(I->getOpcode(), AddrOpName[i]);		AddrIdx[i] = AMDGPU::getNamedOperandIdx(I->getOpcode(), AddrOpName[i]);
AddrReg[i] = &I->getOperand(AddrIdx[i]);		AddrReg[i] = &I->getOperand(AddrIdx[i]);
}		}

		InstsToMove.clear();
}		}

void SILoadStoreOptimizer::CombineInfo::setPaired(MachineBasicBlock::iterator MI,		void SILoadStoreOptimizer::CombineInfo::setPaired(MachineBasicBlock::iterator MI,
const SIInstrInfo &TII) {		const SIInstrInfo &TII) {
Paired = MI;		Paired = MI;
assert(InstClass == getInstClass(Paired->getOpcode(), TII));		assert(InstClass == getInstClass(Paired->getOpcode(), TII));
int OffsetIdx =		int OffsetIdx =
AMDGPU::getNamedOperandIdx(I->getOpcode(), AMDGPU::OpName::offset);		AMDGPU::getNamedOperandIdx(I->getOpcode(), AMDGPU::OpName::offset);
▲ Show 20 Lines • Show All 200 Lines • ▼ Show 20 Lines	bool SILoadStoreOptimizer::findMatchingInst(CombineInfo &CI) {

const unsigned Opc = CI.I->getOpcode();		const unsigned Opc = CI.I->getOpcode();
const InstClassEnum InstClass = getInstClass(Opc, *TII);		const InstClassEnum InstClass = getInstClass(Opc, *TII);

if (InstClass == UNKNOWN) {		if (InstClass == UNKNOWN) {
return false;		return false;
}		}

for (unsigned i = 0; i < CI.NumAddresses; i++) {
// We only ever merge operations with the same base address register, so
// don't bother scanning forward if there are no other uses.
if (CI.AddrReg[i]->isReg() &&
(Register::isPhysicalRegister(CI.AddrReg[i]->getReg()) \|\|
MRI->hasOneNonDBGUse(CI.AddrReg[i]->getReg())))
return false;
}

++MBBI;		++MBBI;

DenseSet<unsigned> RegDefsToMove;		DenseSet<unsigned> RegDefsToMove;
DenseSet<unsigned> PhysRegUsesToMove;		DenseSet<unsigned> PhysRegUsesToMove;
addDefsUsesToList(*CI.I, RegDefsToMove, PhysRegUsesToMove);		addDefsUsesToList(*CI.I, RegDefsToMove, PhysRegUsesToMove);

for (; MBBI != E; ++MBBI) {		for (; MBBI != E; ++MBBI) {
const bool IsDS = (InstClass == DS_READ) \|\| (InstClass == DS_WRITE);		const bool IsDS = (InstClass == DS_READ) \|\| (InstClass == DS_WRITE);
▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	BuildMI(*MBB, CI.Paired, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)		MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, CI.InstsToMove);		moveInstsAfter(Copy1, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();

LLVM_DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');		LLVM_DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');
return Next;		return Read2;
}		}

unsigned SILoadStoreOptimizer::write2Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::write2Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32 : AMDGPU::DS_WRITE2_B64;		return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32 : AMDGPU::DS_WRITE2_B64;
return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32_gfx9		return (EltSize == 4) ? AMDGPU::DS_WRITE2_B32_gfx9
: AMDGPU::DS_WRITE2_B64_gfx9;		: AMDGPU::DS_WRITE2_B64_gfx9;
}		}
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	MachineInstrBuilder Write2 =
.add(*Data1) // data1		.add(*Data1) // data1
.addImm(NewOffset0) // offset0		.addImm(NewOffset0) // offset0
.addImm(NewOffset1) // offset1		.addImm(NewOffset1) // offset1
.addImm(0) // gds		.addImm(0) // gds
.cloneMergedMemRefs({&CI.I, &CI.Paired});		.cloneMergedMemRefs({&CI.I, &CI.Paired});

moveInstsAfter(Write2, CI.InstsToMove);		moveInstsAfter(Write2, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();

LLVM_DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');		LLVM_DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');
return Next;		return Write2;
}		}

MachineBasicBlock::iterator		MachineBasicBlock::iterator
SILoadStoreOptimizer::mergeSBufferLoadImmPair(CombineInfo &CI) {		SILoadStoreOptimizer::mergeSBufferLoadImmPair(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();
const unsigned Opcode = getNewOpcode(CI);		const unsigned Opcode = getNewOpcode(CI);

const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI);		const TargetRegisterClass *SuperRC = getTargetRegisterClass(CI);

unsigned DestReg = MRI->createVirtualRegister(SuperRC);		unsigned DestReg = MRI->createVirtualRegister(SuperRC);
unsigned MergedOffset = std::min(CI.Offset0, CI.Offset1);		unsigned MergedOffset = std::min(CI.Offset0, CI.Offset1);

// It shouldn't be possible to get this far if the two instructions		// It shouldn't be possible to get this far if the two instructions
// don't have a single memoperand, because MachineInstr::mayAlias()		// don't have a single memoperand, because MachineInstr::mayAlias()
// will return true if this is the case.		// will return true if this is the case.
assert(CI.I->hasOneMemOperand() && CI.Paired->hasOneMemOperand());		assert(CI.I->hasOneMemOperand() && CI.Paired->hasOneMemOperand());

const MachineMemOperand MMOa = CI.I->memoperands_begin();		const MachineMemOperand MMOa = CI.I->memoperands_begin();
const MachineMemOperand MMOb = CI.Paired->memoperands_begin();		const MachineMemOperand MMOb = CI.Paired->memoperands_begin();

		MachineInstr *New =
BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode), DestReg)		BuildMI(*MBB, CI.Paired, DL, TII->get(Opcode), DestReg)
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::sbase))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::sbase))
.addImm(MergedOffset) // offset		.addImm(MergedOffset) // offset
.addImm(CI.GLC0) // glc		.addImm(CI.GLC0) // glc
.addImm(CI.DLC0) // dlc		.addImm(CI.DLC0) // dlc
.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));		.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));

std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI);		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI);
const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::sdst);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::sdst);
const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::sdst);		const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::sdst);

BuildMI(*MBB, CI.Paired, DL, CopyDesc)		BuildMI(*MBB, CI.Paired, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)		MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, CI.InstsToMove);		moveInstsAfter(Copy1, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();
return Next;		return New;
}		}

MachineBasicBlock::iterator		MachineBasicBlock::iterator
SILoadStoreOptimizer::mergeBufferLoadPair(CombineInfo &CI) {		SILoadStoreOptimizer::mergeBufferLoadPair(CombineInfo &CI) {
MachineBasicBlock *MBB = CI.I->getParent();		MachineBasicBlock *MBB = CI.I->getParent();
DebugLoc DL = CI.I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

const unsigned Opcode = getNewOpcode(CI);		const unsigned Opcode = getNewOpcode(CI);
Show All 14 Lines	SILoadStoreOptimizer::mergeBufferLoadPair(CombineInfo &CI) {
// It shouldn't be possible to get this far if the two instructions		// It shouldn't be possible to get this far if the two instructions
// don't have a single memoperand, because MachineInstr::mayAlias()		// don't have a single memoperand, because MachineInstr::mayAlias()
// will return true if this is the case.		// will return true if this is the case.
assert(CI.I->hasOneMemOperand() && CI.Paired->hasOneMemOperand());		assert(CI.I->hasOneMemOperand() && CI.Paired->hasOneMemOperand());

const MachineMemOperand MMOa = CI.I->memoperands_begin();		const MachineMemOperand MMOa = CI.I->memoperands_begin();
const MachineMemOperand MMOb = CI.Paired->memoperands_begin();		const MachineMemOperand MMOb = CI.Paired->memoperands_begin();

		MachineInstr *New =
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))
.addImm(MergedOffset) // offset		.addImm(MergedOffset) // offset
.addImm(CI.GLC0) // glc		.addImm(CI.GLC0) // glc
.addImm(CI.SLC0) // slc		.addImm(CI.SLC0) // slc
.addImm(0) // tfe		.addImm(0) // tfe
.addImm(CI.DLC0) // dlc		.addImm(CI.DLC0) // dlc
.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));		.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));

std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI);		std::pair<unsigned, unsigned> SubRegIdx = getSubRegIdxs(CI);
const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);		const unsigned SubRegIdx0 = std::get<0>(SubRegIdx);
const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);		const unsigned SubRegIdx1 = std::get<1>(SubRegIdx);

// Copy to the old destination registers.		// Copy to the old destination registers.
const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);
const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdata);
const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdata);		const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdata);

BuildMI(*MBB, CI.Paired, DL, CopyDesc)		BuildMI(*MBB, CI.Paired, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)		MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, CI.InstsToMove);		moveInstsAfter(Copy1, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();
return Next;		return New;
}		}

unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI) {		unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo &CI) {
const unsigned Width = CI.Width0 + CI.Width1;		const unsigned Width = CI.Width0 + CI.Width1;

switch (CI.InstClass) {		switch (CI.InstClass) {
default:		default:
return AMDGPU::getMUBUFOpcode(CI.InstClass, Width);		return AMDGPU::getMUBUFOpcode(CI.InstClass, Width);
▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	SILoadStoreOptimizer::mergeBufferStorePair(CombineInfo &CI) {
// It shouldn't be possible to get this far if the two instructions		// It shouldn't be possible to get this far if the two instructions
// don't have a single memoperand, because MachineInstr::mayAlias()		// don't have a single memoperand, because MachineInstr::mayAlias()
// will return true if this is the case.		// will return true if this is the case.
assert(CI.I->hasOneMemOperand() && CI.Paired->hasOneMemOperand());		assert(CI.I->hasOneMemOperand() && CI.Paired->hasOneMemOperand());

const MachineMemOperand MMOa = CI.I->memoperands_begin();		const MachineMemOperand MMOa = CI.I->memoperands_begin();
const MachineMemOperand MMOb = CI.Paired->memoperands_begin();		const MachineMemOperand MMOb = CI.Paired->memoperands_begin();

		MachineInstr *New =
MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))		MIB.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::srsrc))
.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))		.add(TII->getNamedOperand(CI.I, AMDGPU::OpName::soffset))
.addImm(std::min(CI.Offset0, CI.Offset1)) // offset		.addImm(std::min(CI.Offset0, CI.Offset1)) // offset
.addImm(CI.GLC0) // glc		.addImm(CI.GLC0) // glc
.addImm(CI.SLC0) // slc		.addImm(CI.SLC0) // slc
.addImm(0) // tfe		.addImm(0) // tfe
.addImm(CI.DLC0) // dlc		.addImm(CI.DLC0) // dlc
.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));		.addMemOperand(combineKnownAdjacentMMOs(*MBB->getParent(), MMOa, MMOb));

moveInstsAfter(MIB, CI.InstsToMove);		moveInstsAfter(MIB, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();
return Next;		return New;
}		}

MachineOperand		MachineOperand
SILoadStoreOptimizer::createRegOrImm(int32_t Val, MachineInstr &MI) const {		SILoadStoreOptimizer::createRegOrImm(int32_t Val, MachineInstr &MI) const {
APInt V(32, Val, true);		APInt V(32, Val, true);
if (TII->isInlineConstant(V))		if (TII->isInlineConstant(V))
return MachineOperand::CreateImm(Val);		return MachineOperand::CreateImm(Val);

▲ Show 20 Lines • Show All 295 Lines • ▼ Show 20 Lines	if (AnchorInst) {
}		}
AnchorList.insert(AnchorInst);		AnchorList.insert(AnchorInst);
return true;		return true;
}		}

return false;		return false;
}		}

// Scan through looking for adjacent LDS operations with constant offsets from		void SILoadStoreOptimizer::addInstToMergeableList(const CombineInfo &CI,
// the same base register. We rely on the scheduler to do the hard work of		std::list<std::list<CombineInfo> > &MergeableInsts) const {
// clustering nearby loads, and assume these are all adjacent.		for (std::list<CombineInfo> &AddrList : MergeableInsts) {
bool SILoadStoreOptimizer::optimizeBlock(MachineBasicBlock &MBB) {		if (AddrList.front().hasSameBaseAddress(*CI.I) &&
bool Modified = false;		AddrList.front().InstClass == CI.InstClass) {
		AddrList.emplace_back(CI);
		return;
		}
		}

		// Base address not found, so add a new list.
		MergeableInsts.emplace_back(1, CI);
		}

		bool SILoadStoreOptimizer::collectMergeableInsts(MachineBasicBlock &MBB,
		std::list<std::list<CombineInfo> > &MergeableInsts) const {
		bool Modified = false;
// Contain the list		// Contain the list
MemInfoMap Visited;		MemInfoMap Visited;
// Contains the list of instructions for which constant offsets are being		// Contains the list of instructions for which constant offsets are being
// promoted to the IMM.		// promoted to the IMM.
SmallPtrSet<MachineInstr *, 4> AnchorList;		SmallPtrSet<MachineInstr *, 4> AnchorList;

for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {		// Sort potential mergeable instructions into lists. One list per base address.
MachineInstr &MI = *I;		for (MachineInstr &MI : MBB.instrs()) {
		// We run this before checking if an address is mergeable, because it can produce
		// better code even if the instructions aren't mergeable.
if (promoteConstantOffsetToImm(MI, Visited, AnchorList))		if (promoteConstantOffsetToImm(MI, Visited, AnchorList))
Modified = true;		Modified = true;

		const InstClassEnum InstClass = getInstClass(MI.getOpcode(), *TII);
		if (InstClass == UNKNOWN)
		continue;

// Don't combine if volatile.		// Don't combine if volatile.
if (MI.hasOrderedMemoryRef()) {		if (MI.hasOrderedMemoryRef())
++I;
continue;		continue;
}

CombineInfo CI;		CombineInfo CI;
CI.setMI(I, TII, STM);		CI.setMI(MI, TII, STM);

		if (!CI.hasMergeableAddress(*MRI))
		continue;

		addInstToMergeableList(CI, MergeableInsts);
		}
		return Modified;
		}

		// Scan through looking for adjacent LDS operations with constant offsets from
		// the same base register. We rely on the scheduler to do the hard work of
		// clustering nearby loads, and assume these are all adjacent.
		bool SILoadStoreOptimizer::optimizeBlock(
		std::list<std::list<CombineInfo> > &MergeableInsts) {
		arsenmUnsubmitted Not Done Reply Inline Actions Why std::list, and a std::list of lists? arsenm: Why std::list, and a std::list of lists?
		tstellarAuthorUnsubmitted Done Reply Inline Actions The main reason to use lists is so I can remove items without invalidating iterators. tstellar: The main reason to use lists is so I can remove items without invalidating iterators.
		bool Modified = false;

		for (std::list<CombineInfo> &MergeList : MergeableInsts) {
		if (MergeList.size() < 2)
		continue;

		bool OptimizeListAgain = false;
		if (!optimizeInstsWithSameBaseAddr(MergeList, OptimizeListAgain)) {
		// We weren't able to make any changes, so clear the list so we don't
		// process the same instructions the next time we try to optimize this
		// block.
		MergeList.clear();
		continue;
		}

		// We made changes, but also determined that there were no more optimization
		// opportunities, so we don't need to reprocess the list
		if (!OptimizeListAgain)
		MergeList.clear();

		OptimizeAgain \|= OptimizeListAgain;
		Modified = true;
		}
		return Modified;
		}

		void
		SILoadStoreOptimizer::removeCombinedInst(std::list<CombineInfo> &MergeList,
		const MachineInstr &MI) {

		for (auto CI = MergeList.begin(), E = MergeList.end(); CI != E; ++CI) {
		if (&*CI->I == &MI) {
		MergeList.erase(CI);
		return;
		}
		}
		}

		bool
		SILoadStoreOptimizer::optimizeInstsWithSameBaseAddr(
		std::list<CombineInfo> &MergeList,
		bool &OptimizeListAgain) {
		bool Modified = false;
		for (auto I = MergeList.begin(); I != MergeList.end(); ++I) {
		CombineInfo &CI = *I;

switch (CI.InstClass) {		switch (CI.InstClass) {
default:		default:
break;		break;
case DS_READ:		case DS_READ:
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeRead2Pair(CI);		removeCombinedInst(MergeList, *CI.Paired);
} else {		MachineBasicBlock::iterator NewMI = mergeRead2Pair(CI);
++I;		CI.setMI(NewMI, TII, STM);
}		}
continue;		break;
case DS_WRITE:		case DS_WRITE:
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeWrite2Pair(CI);		removeCombinedInst(MergeList, *CI.Paired);
} else {		MachineBasicBlock::iterator NewMI = mergeWrite2Pair(CI);
++I;		CI.setMI(NewMI, TII, STM);
}		}
continue;		break;
case S_BUFFER_LOAD_IMM:		case S_BUFFER_LOAD_IMM:
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeSBufferLoadImmPair(CI);		removeCombinedInst(MergeList, *CI.Paired);
OptimizeAgain \|= (CI.Width0 + CI.Width1) < 16;		MachineBasicBlock::iterator NewMI = mergeSBufferLoadImmPair(CI);
} else {		CI.setMI(NewMI, TII, STM);
++I;		OptimizeListAgain \|= (CI.Width0 + CI.Width1) < 16;
}		}
continue;		break;
case BUFFER_LOAD_OFFEN:		case BUFFER_LOAD_OFFEN:
case BUFFER_LOAD_OFFSET:		case BUFFER_LOAD_OFFSET:
case BUFFER_LOAD_OFFEN_exact:		case BUFFER_LOAD_OFFEN_exact:
case BUFFER_LOAD_OFFSET_exact:		case BUFFER_LOAD_OFFSET_exact:
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeBufferLoadPair(CI);		removeCombinedInst(MergeList, *CI.Paired);
OptimizeAgain \|= (CI.Width0 + CI.Width1) < 4;		MachineBasicBlock::iterator NewMI = mergeBufferLoadPair(CI);
} else {		CI.setMI(NewMI, TII, STM);
++I;		OptimizeListAgain \|= (CI.Width0 + CI.Width1) < 4;
}		}
continue;		break;
case BUFFER_STORE_OFFEN:		case BUFFER_STORE_OFFEN:
case BUFFER_STORE_OFFSET:		case BUFFER_STORE_OFFSET:
case BUFFER_STORE_OFFEN_exact:		case BUFFER_STORE_OFFEN_exact:
case BUFFER_STORE_OFFSET_exact:		case BUFFER_STORE_OFFSET_exact:
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeBufferStorePair(CI);		removeCombinedInst(MergeList, *CI.Paired);
OptimizeAgain \|= (CI.Width0 + CI.Width1) < 4;		MachineBasicBlock::iterator NewMI = mergeBufferStorePair(CI);
} else {		CI.setMI(NewMI, TII, STM);
++I;		OptimizeListAgain \|= (CI.Width0 + CI.Width1) < 4;
}		}
continue;		break;
}		}
		// Clear the InstsToMove after we have finished searching so we don't have
++I;		// stale values left over if we search for this CI again in another pass
		// over the block.
		CI.InstsToMove.clear();
}		}

return Modified;		return Modified;
}		}

bool SILoadStoreOptimizer::runOnMachineFunction(MachineFunction &MF) {		bool SILoadStoreOptimizer::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction()))		if (skipFunction(MF.getFunction()))
return false;		return false;
Show All 9 Lines	bool SILoadStoreOptimizer::runOnMachineFunction(MachineFunction &MF) {
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();

assert(MRI->isSSA() && "Must be run on SSA");		assert(MRI->isSSA() && "Must be run on SSA");

LLVM_DEBUG(dbgs() << "Running SILoadStoreOptimizer\n");		LLVM_DEBUG(dbgs() << "Running SILoadStoreOptimizer\n");

bool Modified = false;		bool Modified = false;


for (MachineBasicBlock &MBB : MF) {		for (MachineBasicBlock &MBB : MF) {
		std::list<std::list<CombineInfo> > MergeableInsts;
		// First pass: Collect list of all instructions we know how to merge.
		Modified \|= collectMergeableInsts(MBB, MergeableInsts);
do {		do {
OptimizeAgain = false;		OptimizeAgain = false;
Modified \|= optimizeBlock(MBB);		Modified \|= optimizeBlock(MergeableInsts);
} while (OptimizeAgain);		} while (OptimizeAgain);
}		}

return Modified;		return Modified;
}		}