This is an archive of the discontinued LLVM Phabricator instance.

Differential D31993

[AMDGPU] Combine DS operations with offsets bigger than byte
ClosedPublic

Authored by rampitec on Apr 12 2017, 3:34 PM.

Download Raw Diff

Details

Reviewers

vpykhtin
arsenm

Commits

rGd026f79bd337: [AMDGPU] Combine DS operations with offsets bigger than byte
rL300227: [AMDGPU] Combine DS operations with offsets bigger than byte

Summary

In many cases ds operations can be combined even if offsets do not
fit into 8 bit encoding. What it takes is to adjust base address.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Apr 12 2017, 3:34 PM

Herald added subscribers: t-tye, tpr, dstuttard and 4 others. · View Herald TranscriptApr 12 2017, 3:34 PM

Looks good.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78	Why 3 offsets?

rampitec marked an inline comment as done.Apr 13 2017, 10:42 AM

rampitec added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78	Two offsets as it will be encoded into an LDS instruction, and then base offset which needs to be added with v_add_i32 to the pointer if non zero.

vpykhtin added inline comments.Apr 13 2017, 10:44 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78	But one of it would become zero in the instruction?

rampitec marked an inline comment as done.Apr 13 2017, 10:47 AM

rampitec added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78	For now yes, either Offset0 or BaseOff are zero. In a longer term that is not necessarily so. Imagine you could add less to the base pointer, use both offsets in the encoding, and then reuse new base register for another couple of registers, where this code will generate a separate v_add_i32 otherwise.

vpykhtin accepted this revision.Apr 13 2017, 10:48 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
78	Right.

This revision is now accepted and ready to land.Apr 13 2017, 10:48 AM

Closed by commit rL300227: [AMDGPU] Combine DS operations with offsets bigger than byte (authored by rampitec). · Explain WhyApr 13 2017, 11:05 AM

This revision was automatically updated to reflect the committed changes.

arsenm added inline comments.Apr 13 2017, 11:11 AM

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389 ↗	(On Diff #95170)	This should use the e64 version with an unused carry. We should add a helper to TII to emit this since it will change with GFX9
392 ↗	(On Diff #95170)	BuildMI on next line and shift the .adds to the left
397–398 ↗	(On Diff #95170)	.setMemRefs
459–460 ↗	(On Diff #95170)	Same as with the other add
llvm/trunk/test/CodeGen/AMDGPU/ds-combine-large-stride.ll
1 ↗	(On Diff #95170)	-march is redundant with the triple. Should also use GCN check prefix since the check lines will need to change with GFX9 (should also add a run line for it so it breaks when the add change is made)
370 ↗	(On Diff #95170)	These all need to be marked amdgpu_kernel

http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/2283 is broken by this change.

In D31993#726539, @alekseyshl wrote:

http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/2283 is broken by this change.

Sure, will fix shortly.

In D31993#726563, @rampitec wrote:

In D31993#726539, @alekseyshl wrote:

http://lab.llvm.org:8011/builders/sanitizer-ppc64be-linux/builds/2283 is broken by this change.

Sure, will fix shortly.

Oh, it is already fixed. Thanks!

rampitec marked an inline comment as done.Apr 13 2017, 2:41 PM

rampitec added inline comments.

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389 ↗	(On Diff #95170)	Matt, I doubt we should use e64 version here. It does not accept immediate, which effectively would require one more SGPR and one more mov. A vcc thrashing seems to be less issue.

rampitec added inline comments.Apr 13 2017, 2:52 PM

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389 ↗	(On Diff #95170)	Even new no-carry variant is VOP3, so same issue.

rampitec mentioned this in D32057: [AMDGPU] added SIInstrInfo::getAddNoCarry() helper.Apr 13 2017, 3:16 PM

rampitec added a child revision: D32057: [AMDGPU] added SIInstrInfo::getAddNoCarry() helper.

arsenm added inline comments.Apr 13 2017, 3:37 PM

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389 ↗	(On Diff #95170)	You can materialize the constant in a register. It will be folded and shrunk later

rampitec added inline comments.Apr 13 2017, 3:54 PM

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
387–389 ↗	(On Diff #95170)	It is not folded though: s_movk_i32 s0, 0x960 v_add_i32_e32 v0, vcc, s0, v4 ds_read2_b32 v[0:1], v0 offset1:100 Also note, that values in this case will be mostly unique, because they are added to the base pointer, not to a previously incremented pointer. In latter case the would be high probability that many adds would have the same literal. I would lean towards using e64 version and an SGPR when the pass is redesigned to first collect all pairs, build a chain of adds, and only then combine. In this case we will use single SGPR.

rampitec mentioned this in rL300288: [AMDGPU] added SIInstrInfo::getAddNoCarry() helper.Apr 13 2017, 5:46 PM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SILoadStoreOptimizer.cpp

316 lines

test/

CodeGen/

AMDGPU/

ds-combine-large-stride.ll

385 lines

Diff 95040

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "si-load-store-opt"		#define DEBUG_TYPE "si-load-store-opt"

namespace {		namespace {

class SILoadStoreOptimizer : public MachineFunctionPass {		class SILoadStoreOptimizer : public MachineFunctionPass {

		typedef struct {
		MachineBasicBlock::iterator I;
		MachineBasicBlock::iterator Paired;
		unsigned EltSize;
		unsigned Offset0;
		unsigned Offset1;
		unsigned BaseOff;
		vpykhtinUnsubmitted Done Reply Inline Actions Why 3 offsets? vpykhtin: Why 3 offsets?
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions Two offsets as it will be encoded into an LDS instruction, and then base offset which needs to be added with v_add_i32 to the pointer if non zero. rampitec: Two offsets as it will be encoded into an LDS instruction, and then base offset which needs to…
		vpykhtinUnsubmitted Not Done Reply Inline Actions But one of it would become zero in the instruction? vpykhtin: But one of it would become zero in the instruction?
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions For now yes, either Offset0 or BaseOff are zero. In a longer term that is not necessarily so. Imagine you could add less to the base pointer, use both offsets in the encoding, and then reuse new base register for another couple of registers, where this code will generate a separate v_add_i32 otherwise. rampitec: For now yes, either Offset0 or BaseOff are zero. In a longer term that is not necessarily so.
		vpykhtinUnsubmitted Not Done Reply Inline Actions Right. vpykhtin: Right.
		bool UseST64;
		SmallVector<MachineInstr*, 8> InstsToMove;
		} CombineInfo;

private:		private:
const SIInstrInfo *TII = nullptr;		const SIInstrInfo *TII = nullptr;
const SIRegisterInfo *TRI = nullptr;		const SIRegisterInfo *TRI = nullptr;
MachineRegisterInfo *MRI = nullptr;		MachineRegisterInfo *MRI = nullptr;
AliasAnalysis *AA = nullptr;		AliasAnalysis *AA = nullptr;

static bool offsetsCanBeCombined(unsigned Offset0,		static bool offsetsCanBeCombined(CombineInfo &CI);
unsigned Offset1,
unsigned EltSize);		bool findMatchingDSInst(CombineInfo &CI);

MachineBasicBlock::iterator findMatchingDSInst(		MachineBasicBlock::iterator mergeRead2Pair(CombineInfo &CI);
MachineBasicBlock::iterator I,
unsigned EltSize,		MachineBasicBlock::iterator mergeWrite2Pair(CombineInfo &CI);
SmallVectorImpl<MachineInstr*> &InstsToMove);

MachineBasicBlock::iterator mergeRead2Pair(
MachineBasicBlock::iterator I,
MachineBasicBlock::iterator Paired,
unsigned EltSize,
ArrayRef<MachineInstr*> InstsToMove);

MachineBasicBlock::iterator mergeWrite2Pair(
MachineBasicBlock::iterator I,
MachineBasicBlock::iterator Paired,
unsigned EltSize,
ArrayRef<MachineInstr*> InstsToMove);

public:		public:
static char ID;		static char ID;

SILoadStoreOptimizer() : MachineFunctionPass(ID) {}		SILoadStoreOptimizer() : MachineFunctionPass(ID) {}

SILoadStoreOptimizer(const TargetMachine &TM_) : MachineFunctionPass(ID) {		SILoadStoreOptimizer(const TargetMachine &TM_) : MachineFunctionPass(ID) {
initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());		initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	for (MachineInstr *InstToMove : InstsToMove) {
if (!InstToMove->mayLoadOrStore())		if (!InstToMove->mayLoadOrStore())
continue;		continue;
if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))		if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))
return false;		return false;
}		}
return true;		return true;
}		}

bool SILoadStoreOptimizer::offsetsCanBeCombined(unsigned Offset0,		bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI) {
unsigned Offset1,
unsigned Size) {
// XXX - Would the same offset be OK? Is there any reason this would happen or		// XXX - Would the same offset be OK? Is there any reason this would happen or
// be useful?		// be useful?
if (Offset0 == Offset1)		if (CI.Offset0 == CI.Offset1)
return false;		return false;

// This won't be valid if the offset isn't aligned.		// This won't be valid if the offset isn't aligned.
if ((Offset0 % Size != 0) \|\| (Offset1 % Size != 0))		if ((CI.Offset0 % CI.EltSize != 0) \|\| (CI.Offset1 % CI.EltSize != 0))
return false;		return false;

unsigned EltOffset0 = Offset0 / Size;		unsigned EltOffset0 = CI.Offset0 / CI.EltSize;
unsigned EltOffset1 = Offset1 / Size;		unsigned EltOffset1 = CI.Offset1 / CI.EltSize;
		CI.UseST64 = false;
		CI.BaseOff = 0;

		// If the offset in elements doesn't fit in 8-bits, we might be able to use
		// the stride 64 versions.
		if ((EltOffset0 % 64 == 0) && (EltOffset1 % 64) == 0 &&
		isUInt<8>(EltOffset0 / 64) && isUInt<8>(EltOffset1 / 64)) {
		CI.Offset0 = EltOffset0 / 64;
		CI.Offset1 = EltOffset1 / 64;
		CI.UseST64 = true;
		return true;
		}

// Check if the new offsets fit in the reduced 8-bit range.		// Check if the new offsets fit in the reduced 8-bit range.
if (isUInt<8>(EltOffset0) && isUInt<8>(EltOffset1))		if (isUInt<8>(EltOffset0) && isUInt<8>(EltOffset1)) {
		CI.Offset0 = EltOffset0;
		CI.Offset1 = EltOffset1;
return true;		return true;
		}

// If the offset in elements doesn't fit in 8-bits, we might be able to use		// Try to shift base address to decrease offsets.
// the stride 64 versions.		unsigned OffsetDiff = std::abs((int)EltOffset1 - (int)EltOffset0);
if ((EltOffset0 % 64 != 0) \|\| (EltOffset1 % 64) != 0)		CI.BaseOff = std::min(CI.Offset0, CI.Offset1);
return false;
		if ((OffsetDiff % 64 == 0) && isUInt<8>(OffsetDiff / 64)) {
		CI.Offset0 = (EltOffset0 - CI.BaseOff / CI.EltSize) / 64;
		CI.Offset1 = (EltOffset1 - CI.BaseOff / CI.EltSize) / 64;
		CI.UseST64 = true;
		return true;
		}

		if (isUInt<8>(OffsetDiff)) {
		CI.Offset0 = EltOffset0 - CI.BaseOff / CI.EltSize;
		CI.Offset1 = EltOffset1 - CI.BaseOff / CI.EltSize;
		return true;
		}

return isUInt<8>(EltOffset0 / 64) && isUInt<8>(EltOffset1 / 64);		return false;
}		}

MachineBasicBlock::iterator		bool SILoadStoreOptimizer::findMatchingDSInst(CombineInfo &CI) {
SILoadStoreOptimizer::findMatchingDSInst(MachineBasicBlock::iterator I,		MachineBasicBlock::iterator E = CI.I->getParent()->end();
unsigned EltSize,		MachineBasicBlock::iterator MBBI = CI.I;
SmallVectorImpl<MachineInstr*> &InstsToMove) {
MachineBasicBlock::iterator E = I->getParent()->end();
MachineBasicBlock::iterator MBBI = I;
++MBBI;		++MBBI;

SmallVector<const MachineOperand *, 8> DefsToMove;		SmallVector<const MachineOperand *, 8> DefsToMove;
addDefsToList(*I, DefsToMove);		addDefsToList(*CI.I, DefsToMove);

for ( ; MBBI != E; ++MBBI) {		for ( ; MBBI != E; ++MBBI) {
if (MBBI->getOpcode() != I->getOpcode()) {		if (MBBI->getOpcode() != CI.I->getOpcode()) {

// This is not a matching DS instruction, but we can keep looking as		// This is not a matching DS instruction, but we can keep looking as
// long as one of these conditions are met:		// long as one of these conditions are met:
// 1. It is safe to move I down past MBBI.		// 1. It is safe to move I down past MBBI.
// 2. It is safe to move MBBI down past the instruction that I will		// 2. It is safe to move MBBI down past the instruction that I will
// be merged into.		// be merged into.

if (MBBI->hasUnmodeledSideEffects())		if (MBBI->hasUnmodeledSideEffects())
// We can't re-order this instruction with respect to other memory		// We can't re-order this instruction with respect to other memory
// opeations, so we fail both conditions mentioned above.		// opeations, so we fail both conditions mentioned above.
return E;		return false;

if (MBBI->mayLoadOrStore() &&		if (MBBI->mayLoadOrStore() &&
!memAccessesCanBeReordered(I, MBBI, TII, AA)) {		!memAccessesCanBeReordered(CI.I, MBBI, TII, AA)) {
// We fail condition #1, but we may still be able to satisfy condition		// We fail condition #1, but we may still be able to satisfy condition
// #2. Add this instruction to the move list and then we will check		// #2. Add this instruction to the move list and then we will check
// if condition #2 holds once we have selected the matching instruction.		// if condition #2 holds once we have selected the matching instruction.
InstsToMove.push_back(&*MBBI);		CI.InstsToMove.push_back(&*MBBI);
addDefsToList(*MBBI, DefsToMove);		addDefsToList(*MBBI, DefsToMove);
continue;		continue;
}		}

// When we match I with another DS instruction we will be moving I down		// When we match I with another DS instruction we will be moving I down
// to the location of the matched instruction any uses of I will need to		// to the location of the matched instruction any uses of I will need to
// be moved down as well.		// be moved down as well.
addToListsIfDependent(*MBBI, DefsToMove, InstsToMove);		addToListsIfDependent(*MBBI, DefsToMove, CI.InstsToMove);
continue;		continue;
}		}

// Don't merge volatiles.		// Don't merge volatiles.
if (MBBI->hasOrderedMemoryRef())		if (MBBI->hasOrderedMemoryRef())
return E;		return false;

// Handle a case like		// Handle a case like
// DS_WRITE_B32 addr, v, idx0		// DS_WRITE_B32 addr, v, idx0
// w = DS_READ_B32 addr, idx0		// w = DS_READ_B32 addr, idx0
// DS_WRITE_B32 addr, f(w), idx1		// DS_WRITE_B32 addr, f(w), idx1
// where the DS_READ_B32 ends up in InstsToMove and therefore prevents		// where the DS_READ_B32 ends up in InstsToMove and therefore prevents
// merging of the two writes.		// merging of the two writes.
if (addToListsIfDependent(*MBBI, DefsToMove, InstsToMove))		if (addToListsIfDependent(*MBBI, DefsToMove, CI.InstsToMove))
continue;		continue;

int AddrIdx = AMDGPU::getNamedOperandIdx(I->getOpcode(), AMDGPU::OpName::addr);		int AddrIdx = AMDGPU::getNamedOperandIdx(CI.I->getOpcode(),
const MachineOperand &AddrReg0 = I->getOperand(AddrIdx);		AMDGPU::OpName::addr);
		const MachineOperand &AddrReg0 = CI.I->getOperand(AddrIdx);
const MachineOperand &AddrReg1 = MBBI->getOperand(AddrIdx);		const MachineOperand &AddrReg1 = MBBI->getOperand(AddrIdx);

// Check same base pointer. Be careful of subregisters, which can occur with		// Check same base pointer. Be careful of subregisters, which can occur with
// vectors of pointers.		// vectors of pointers.
if (AddrReg0.getReg() == AddrReg1.getReg() &&		if (AddrReg0.getReg() == AddrReg1.getReg() &&
AddrReg0.getSubReg() == AddrReg1.getSubReg()) {		AddrReg0.getSubReg() == AddrReg1.getSubReg()) {
int OffsetIdx = AMDGPU::getNamedOperandIdx(I->getOpcode(),		int OffsetIdx = AMDGPU::getNamedOperandIdx(CI.I->getOpcode(),
AMDGPU::OpName::offset);		AMDGPU::OpName::offset);
unsigned Offset0 = I->getOperand(OffsetIdx).getImm() & 0xffff;		CI.Offset0 = CI.I->getOperand(OffsetIdx).getImm() & 0xffff;
unsigned Offset1 = MBBI->getOperand(OffsetIdx).getImm() & 0xffff;		CI.Offset1 = MBBI->getOperand(OffsetIdx).getImm() & 0xffff;
		CI.Paired = MBBI;

// Check both offsets fit in the reduced range.		// Check both offsets fit in the reduced range.
// We also need to go through the list of instructions that we plan to		// We also need to go through the list of instructions that we plan to
// move and make sure they are all safe to move down past the merged		// move and make sure they are all safe to move down past the merged
// instruction.		// instruction.
if (offsetsCanBeCombined(Offset0, Offset1, EltSize) &&		if (offsetsCanBeCombined(CI))
canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA))		if (canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))
return MBBI;		return true;
}		}

// We've found a load/store that we couldn't merge for some reason.		// We've found a load/store that we couldn't merge for some reason.
// We could potentially keep looking, but we'd need to make sure that		// We could potentially keep looking, but we'd need to make sure that
// it was safe to move I and also all the instruction in InstsToMove		// it was safe to move I and also all the instruction in InstsToMove
// down past this instruction.		// down past this instruction.
if (!memAccessesCanBeReordered(I, MBBI, TII, AA) \|\| // check if we can move I across MBBI		// check if we can move I across MBBI and if we can move all I's users
!canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA) // check if we can move all I's users		if (!memAccessesCanBeReordered(CI.I, MBBI, TII, AA) \|\|
)		!canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))
break;		break;
}		}
return E;		return false;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeRead2Pair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeRead2Pair(
MachineBasicBlock::iterator I,		CombineInfo &CI) {
MachineBasicBlock::iterator Paired,		MachineBasicBlock *MBB = CI.I->getParent();
unsigned EltSize,
ArrayRef<MachineInstr*> InstsToMove) {
MachineBasicBlock *MBB = I->getParent();

// Be careful, since the addresses could be subregisters themselves in weird		// Be careful, since the addresses could be subregisters themselves in weird
// cases, like vectors of pointers.		// cases, like vectors of pointers.
const MachineOperand AddrReg = TII->getNamedOperand(I, AMDGPU::OpName::addr);		const auto AddrReg = TII->getNamedOperand(CI.I, AMDGPU::OpName::addr);

const MachineOperand Dest0 = TII->getNamedOperand(I, AMDGPU::OpName::vdst);		const auto Dest0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::vdst);
const MachineOperand Dest1 = TII->getNamedOperand(Paired, AMDGPU::OpName::vdst);		const auto Dest1 = TII->getNamedOperand(CI.Paired, AMDGPU::OpName::vdst);

unsigned Offset0		unsigned NewOffset0 = CI.Offset0;
= TII->getNamedOperand(*I, AMDGPU::OpName::offset)->getImm() & 0xffff;		unsigned NewOffset1 = CI.Offset1;
unsigned Offset1		unsigned Opc = (CI.EltSize == 4) ? AMDGPU::DS_READ2_B32
= TII->getNamedOperand(*Paired, AMDGPU::OpName::offset)->getImm() & 0xffff;		: AMDGPU::DS_READ2_B64;

unsigned NewOffset0 = Offset0 / EltSize;		if (CI.UseST64)
unsigned NewOffset1 = Offset1 / EltSize;		Opc = (CI.EltSize == 4) ? AMDGPU::DS_READ2ST64_B32
unsigned Opc = (EltSize == 4) ? AMDGPU::DS_READ2_B32 : AMDGPU::DS_READ2_B64;		: AMDGPU::DS_READ2ST64_B64;

// Prefer the st64 form if we can use it, even if we can fit the offset in the
// non st64 version. I'm not sure if there's any real reason to do this.
bool UseST64 = (NewOffset0 % 64 == 0) && (NewOffset1 % 64 == 0);
if (UseST64) {
NewOffset0 /= 64;
NewOffset1 /= 64;
Opc = (EltSize == 4) ? AMDGPU::DS_READ2ST64_B32 : AMDGPU::DS_READ2ST64_B64;
}

unsigned SubRegIdx0 = (EltSize == 4) ? AMDGPU::sub0 : AMDGPU::sub0_sub1;		unsigned SubRegIdx0 = (CI.EltSize == 4) ? AMDGPU::sub0 : AMDGPU::sub0_sub1;
unsigned SubRegIdx1 = (EltSize == 4) ? AMDGPU::sub1 : AMDGPU::sub2_sub3;		unsigned SubRegIdx1 = (CI.EltSize == 4) ? AMDGPU::sub1 : AMDGPU::sub2_sub3;

if (NewOffset0 > NewOffset1) {		if (NewOffset0 > NewOffset1) {
// Canonicalize the merged instruction so the smaller offset comes first.		// Canonicalize the merged instruction so the smaller offset comes first.
std::swap(NewOffset0, NewOffset1);		std::swap(NewOffset0, NewOffset1);
std::swap(SubRegIdx0, SubRegIdx1);		std::swap(SubRegIdx0, SubRegIdx1);
}		}

assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&		assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&
(NewOffset0 != NewOffset1) &&		(NewOffset0 != NewOffset1) &&
"Computed offset doesn't fit");		"Computed offset doesn't fit");

const MCInstrDesc &Read2Desc = TII->get(Opc);		const MCInstrDesc &Read2Desc = TII->get(Opc);

const TargetRegisterClass *SuperRC		const TargetRegisterClass *SuperRC
= (EltSize == 4) ? &AMDGPU::VReg_64RegClass : &AMDGPU::VReg_128RegClass;		= (CI.EltSize == 4) ? &AMDGPU::VReg_64RegClass : &AMDGPU::VReg_128RegClass;
unsigned DestReg = MRI->createVirtualRegister(SuperRC);		unsigned DestReg = MRI->createVirtualRegister(SuperRC);

DebugLoc DL = I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();
MachineInstrBuilder Read2 = BuildMI(*MBB, Paired, DL, Read2Desc, DestReg)
.add(*AddrReg) // addr		unsigned BaseReg = AddrReg->getReg();
		unsigned BaseRegFlags = 0;
		if (CI.BaseOff) {
		BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
		BaseRegFlags = RegState::Kill;
		BuildMI(MBB, CI.Paired, DL, TII->get(AMDGPU::V_ADD_I32_e32), BaseReg)
		.addImm(CI.BaseOff)
		.addReg(AddrReg->getReg());
		}

		MachineInstrBuilder Read2 = BuildMI(*MBB, CI.Paired, DL, Read2Desc, DestReg)
		.addReg(BaseReg, BaseRegFlags) // addr
.addImm(NewOffset0) // offset0		.addImm(NewOffset0) // offset0
.addImm(NewOffset1) // offset1		.addImm(NewOffset1) // offset1
.addImm(0) // gds		.addImm(0) // gds
.addMemOperand(*I->memoperands_begin())		.addMemOperand(*CI.I->memoperands_begin())
.addMemOperand(*Paired->memoperands_begin());		.addMemOperand(*CI.Paired->memoperands_begin());
(void)Read2;		(void)Read2;

const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);		const MCInstrDesc &CopyDesc = TII->get(TargetOpcode::COPY);

// Copy to the old destination registers.		// Copy to the old destination registers.
BuildMI(*MBB, Paired, DL, CopyDesc)		BuildMI(*MBB, CI.Paired, DL, CopyDesc)
.add(*Dest0) // Copy to same destination including flags and sub reg.		.add(*Dest0) // Copy to same destination including flags and sub reg.
.addReg(DestReg, 0, SubRegIdx0);		.addReg(DestReg, 0, SubRegIdx0);
MachineInstr Copy1 = BuildMI(MBB, Paired, DL, CopyDesc)		MachineInstr Copy1 = BuildMI(MBB, CI.Paired, DL, CopyDesc)
.add(*Dest1)		.add(*Dest1)
.addReg(DestReg, RegState::Kill, SubRegIdx1);		.addReg(DestReg, RegState::Kill, SubRegIdx1);

moveInstsAfter(Copy1, InstsToMove);		moveInstsAfter(Copy1, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(I);		MachineBasicBlock::iterator Next = std::next(CI.I);
I->eraseFromParent();		CI.I->eraseFromParent();
Paired->eraseFromParent();		CI.Paired->eraseFromParent();

DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');		DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');
return Next;		return Next;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeWrite2Pair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeWrite2Pair(
MachineBasicBlock::iterator I,		CombineInfo &CI) {
MachineBasicBlock::iterator Paired,		MachineBasicBlock *MBB = CI.I->getParent();
unsigned EltSize,
ArrayRef<MachineInstr*> InstsToMove) {
MachineBasicBlock *MBB = I->getParent();

// Be sure to use .addOperand(), and not .addReg() with these. We want to be		// Be sure to use .addOperand(), and not .addReg() with these. We want to be
// sure we preserve the subregister index and any register flags set on them.		// sure we preserve the subregister index and any register flags set on them.
const MachineOperand Addr = TII->getNamedOperand(I, AMDGPU::OpName::addr);		const MachineOperand Addr = TII->getNamedOperand(CI.I, AMDGPU::OpName::addr);
const MachineOperand Data0 = TII->getNamedOperand(I, AMDGPU::OpName::data0);		const MachineOperand Data0 = TII->getNamedOperand(CI.I, AMDGPU::OpName::data0);
const MachineOperand *Data1		const MachineOperand *Data1
= TII->getNamedOperand(*Paired, AMDGPU::OpName::data0);		= TII->getNamedOperand(*CI.Paired, AMDGPU::OpName::data0);

		unsigned NewOffset0 = CI.Offset0;
unsigned Offset0		unsigned NewOffset1 = CI.Offset1;
= TII->getNamedOperand(*I, AMDGPU::OpName::offset)->getImm() & 0xffff;		unsigned Opc = (CI.EltSize == 4) ? AMDGPU::DS_WRITE2_B32
unsigned Offset1		: AMDGPU::DS_WRITE2_B64;
= TII->getNamedOperand(*Paired, AMDGPU::OpName::offset)->getImm() & 0xffff;
		if (CI.UseST64)
unsigned NewOffset0 = Offset0 / EltSize;		Opc = (CI.EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32
unsigned NewOffset1 = Offset1 / EltSize;		: AMDGPU::DS_WRITE2ST64_B64;
unsigned Opc = (EltSize == 4) ? AMDGPU::DS_WRITE2_B32 : AMDGPU::DS_WRITE2_B64;

// Prefer the st64 form if we can use it, even if we can fit the offset in the
// non st64 version. I'm not sure if there's any real reason to do this.
bool UseST64 = (NewOffset0 % 64 == 0) && (NewOffset1 % 64 == 0);
if (UseST64) {
NewOffset0 /= 64;
NewOffset1 /= 64;
Opc = (EltSize == 4) ? AMDGPU::DS_WRITE2ST64_B32 : AMDGPU::DS_WRITE2ST64_B64;
}

if (NewOffset0 > NewOffset1) {		if (NewOffset0 > NewOffset1) {
// Canonicalize the merged instruction so the smaller offset comes first.		// Canonicalize the merged instruction so the smaller offset comes first.
std::swap(NewOffset0, NewOffset1);		std::swap(NewOffset0, NewOffset1);
std::swap(Data0, Data1);		std::swap(Data0, Data1);
}		}

assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&		assert((isUInt<8>(NewOffset0) && isUInt<8>(NewOffset1)) &&
(NewOffset0 != NewOffset1) &&		(NewOffset0 != NewOffset1) &&
"Computed offset doesn't fit");		"Computed offset doesn't fit");

const MCInstrDesc &Write2Desc = TII->get(Opc);		const MCInstrDesc &Write2Desc = TII->get(Opc);
DebugLoc DL = I->getDebugLoc();		DebugLoc DL = CI.I->getDebugLoc();

		unsigned BaseReg = Addr->getReg();
		unsigned BaseRegFlags = 0;
		if (CI.BaseOff) {
		BaseReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
		BaseRegFlags = RegState::Kill;
		BuildMI(MBB, CI.Paired, DL, TII->get(AMDGPU::V_ADD_I32_e32), BaseReg)
		.addImm(CI.BaseOff)
		.addReg(Addr->getReg());
		}

MachineInstrBuilder Write2 = BuildMI(*MBB, Paired, DL, Write2Desc)		MachineInstrBuilder Write2 = BuildMI(*MBB, CI.Paired, DL, Write2Desc)
.add(*Addr) // addr		.addReg(BaseReg, BaseRegFlags) // addr
.add(*Data0) // data0		.add(*Data0) // data0
.add(*Data1) // data1		.add(*Data1) // data1
.addImm(NewOffset0) // offset0		.addImm(NewOffset0) // offset0
.addImm(NewOffset1) // offset1		.addImm(NewOffset1) // offset1
.addImm(0) // gds		.addImm(0) // gds
.addMemOperand(*I->memoperands_begin())		.addMemOperand(*CI.I->memoperands_begin())
.addMemOperand(*Paired->memoperands_begin());		.addMemOperand(*CI.Paired->memoperands_begin());

moveInstsAfter(Write2, InstsToMove);		moveInstsAfter(Write2, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(I);		MachineBasicBlock::iterator Next = std::next(CI.I);
I->eraseFromParent();		CI.I->eraseFromParent();
Paired->eraseFromParent();		CI.Paired->eraseFromParent();

DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');		DEBUG(dbgs() << "Inserted write2 inst: " << *Write2 << '\n');
return Next;		return Next;
}		}

// Scan through looking for adjacent LDS operations with constant offsets from		// Scan through looking for adjacent LDS operations with constant offsets from
// the same base register. We rely on the scheduler to do the hard work of		// the same base register. We rely on the scheduler to do the hard work of
// clustering nearby loads, and assume these are all adjacent.		// clustering nearby loads, and assume these are all adjacent.
bool SILoadStoreOptimizer::optimizeBlock(MachineBasicBlock &MBB) {		bool SILoadStoreOptimizer::optimizeBlock(MachineBasicBlock &MBB) {
bool Modified = false;		bool Modified = false;

for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {		for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {
MachineInstr &MI = *I;		MachineInstr &MI = *I;

// Don't combine if volatile.		// Don't combine if volatile.
if (MI.hasOrderedMemoryRef()) {		if (MI.hasOrderedMemoryRef()) {
++I;		++I;
continue;		continue;
}		}

SmallVector<MachineInstr*, 8> InstsToMove;		CombineInfo CI;
		CI.I = I;
unsigned Opc = MI.getOpcode();		unsigned Opc = MI.getOpcode();
if (Opc == AMDGPU::DS_READ_B32 \|\| Opc == AMDGPU::DS_READ_B64) {		if (Opc == AMDGPU::DS_READ_B32 \|\| Opc == AMDGPU::DS_READ_B64) {
unsigned Size = (Opc == AMDGPU::DS_READ_B64) ? 8 : 4;		CI.EltSize = (Opc == AMDGPU::DS_READ_B64) ? 8 : 4;
MachineBasicBlock::iterator Match = findMatchingDSInst(I, Size,		if (findMatchingDSInst(CI)) {
InstsToMove);
if (Match != E) {
Modified = true;		Modified = true;
I = mergeRead2Pair(I, Match, Size, InstsToMove);		I = mergeRead2Pair(CI);
} else {		} else {
++I;		++I;
}		}

continue;		continue;
} else if (Opc == AMDGPU::DS_WRITE_B32 \|\| Opc == AMDGPU::DS_WRITE_B64) {		} else if (Opc == AMDGPU::DS_WRITE_B32 \|\| Opc == AMDGPU::DS_WRITE_B64) {
unsigned Size = (Opc == AMDGPU::DS_WRITE_B64) ? 8 : 4;		CI.EltSize = (Opc == AMDGPU::DS_WRITE_B64) ? 8 : 4;
MachineBasicBlock::iterator Match = findMatchingDSInst(I, Size,		if (findMatchingDSInst(CI)) {
InstsToMove);
if (Match != E) {
Modified = true;		Modified = true;
I = mergeWrite2Pair(I, Match, Size, InstsToMove);		I = mergeWrite2Pair(CI);
} else {		} else {
++I;		++I;
}		}

continue;		continue;
}		}

++I;		++I;
Show All 28 Lines

test/CodeGen/AMDGPU/ds-combine-large-stride.ll

This file was added.

				; RUN: llc -march=amdgcn -mtriple=amdgcn--amdhsa -verify-machineinstrs < %s \| FileCheck %s

				; CHECK-LABEL: ds_read32_combine_stride_400:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 0x320, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x640, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x960, [[BASE]]
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:100
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:100
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:100
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:100
				define void @ds_read32_combine_stride_400(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
				bb:
				%tmp = load float, float addrspace(3)* %arg, align 4
				%tmp2 = fadd float %tmp, 0.000000e+00
				%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 100
				%tmp4 = load float, float addrspace(3)* %tmp3, align 4
				%tmp5 = fadd float %tmp2, %tmp4
				%tmp6 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200
				%tmp7 = load float, float addrspace(3)* %tmp6, align 4
				%tmp8 = fadd float %tmp5, %tmp7
				%tmp9 = getelementptr inbounds float, float addrspace(3)* %arg, i32 300
				%tmp10 = load float, float addrspace(3)* %tmp9, align 4
				%tmp11 = fadd float %tmp8, %tmp10
				%tmp12 = getelementptr inbounds float, float addrspace(3)* %arg, i32 400
				%tmp13 = load float, float addrspace(3)* %tmp12, align 4
				%tmp14 = fadd float %tmp11, %tmp13
				%tmp15 = getelementptr inbounds float, float addrspace(3)* %arg, i32 500
				%tmp16 = load float, float addrspace(3)* %tmp15, align 4
				%tmp17 = fadd float %tmp14, %tmp16
				%tmp18 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600
				%tmp19 = load float, float addrspace(3)* %tmp18, align 4
				%tmp20 = fadd float %tmp17, %tmp19
				%tmp21 = getelementptr inbounds float, float addrspace(3)* %arg, i32 700
				%tmp22 = load float, float addrspace(3)* %tmp21, align 4
				%tmp23 = fadd float %tmp20, %tmp22
				store float %tmp23, float *%arg1, align 4
				ret void
				}

				; CHECK-LABEL: ds_read32_combine_stride_400_back:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 0x320, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x640, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x960, [[BASE]]
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:100
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:100
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:100
				; CHECK-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:100
				define void @ds_read32_combine_stride_400_back(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
				bb:
				%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 700
				%tmp2 = load float, float addrspace(3)* %tmp, align 4
				%tmp3 = fadd float %tmp2, 0.000000e+00
				%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600
				%tmp5 = load float, float addrspace(3)* %tmp4, align 4
				%tmp6 = fadd float %tmp3, %tmp5
				%tmp7 = getelementptr inbounds float, float addrspace(3)* %arg, i32 500
				%tmp8 = load float, float addrspace(3)* %tmp7, align 4
				%tmp9 = fadd float %tmp6, %tmp8
				%tmp10 = getelementptr inbounds float, float addrspace(3)* %arg, i32 400
				%tmp11 = load float, float addrspace(3)* %tmp10, align 4
				%tmp12 = fadd float %tmp9, %tmp11
				%tmp13 = getelementptr inbounds float, float addrspace(3)* %arg, i32 300
				%tmp14 = load float, float addrspace(3)* %tmp13, align 4
				%tmp15 = fadd float %tmp12, %tmp14
				%tmp16 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200
				%tmp17 = load float, float addrspace(3)* %tmp16, align 4
				%tmp18 = fadd float %tmp15, %tmp17
				%tmp19 = getelementptr inbounds float, float addrspace(3)* %arg, i32 100
				%tmp20 = load float, float addrspace(3)* %tmp19, align 4
				%tmp21 = fadd float %tmp18, %tmp20
				%tmp22 = load float, float addrspace(3)* %arg, align 4
				%tmp23 = fadd float %tmp21, %tmp22
				store float %tmp23, float *%arg1, align 4
				ret void
				}

				; CHECK-LABEL: ds_read32_combine_stride_8192:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:32
				; CHECK-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:64 offset1:96
				; CHECK-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:128 offset1:160
				; CHECK-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:192 offset1:224
				define void @ds_read32_combine_stride_8192(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
				bb:
				%tmp = load float, float addrspace(3)* %arg, align 4
				%tmp2 = fadd float %tmp, 0.000000e+00
				%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 2048
				%tmp4 = load float, float addrspace(3)* %tmp3, align 4
				%tmp5 = fadd float %tmp2, %tmp4
				%tmp6 = getelementptr inbounds float, float addrspace(3)* %arg, i32 4096
				%tmp7 = load float, float addrspace(3)* %tmp6, align 4
				%tmp8 = fadd float %tmp5, %tmp7
				%tmp9 = getelementptr inbounds float, float addrspace(3)* %arg, i32 6144
				%tmp10 = load float, float addrspace(3)* %tmp9, align 4
				%tmp11 = fadd float %tmp8, %tmp10
				%tmp12 = getelementptr inbounds float, float addrspace(3)* %arg, i32 8192
				%tmp13 = load float, float addrspace(3)* %tmp12, align 4
				%tmp14 = fadd float %tmp11, %tmp13
				%tmp15 = getelementptr inbounds float, float addrspace(3)* %arg, i32 10240
				%tmp16 = load float, float addrspace(3)* %tmp15, align 4
				%tmp17 = fadd float %tmp14, %tmp16
				%tmp18 = getelementptr inbounds float, float addrspace(3)* %arg, i32 12288
				%tmp19 = load float, float addrspace(3)* %tmp18, align 4
				%tmp20 = fadd float %tmp17, %tmp19
				%tmp21 = getelementptr inbounds float, float addrspace(3)* %arg, i32 14336
				%tmp22 = load float, float addrspace(3)* %tmp21, align 4
				%tmp23 = fadd float %tmp20, %tmp22
				store float %tmp23, float *%arg1, align 4
				ret void
				}

				; CHECK-LABEL: ds_read32_combine_stride_8192_shifted:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x4008, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x8008, [[BASE]]
				; CHECK-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:32
				; CHECK-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:32
				; CHECK-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:32
				define void @ds_read32_combine_stride_8192_shifted(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
				bb:
				%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 2
				%tmp2 = load float, float addrspace(3)* %tmp, align 4
				%tmp3 = fadd float %tmp2, 0.000000e+00
				%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 2050
				%tmp5 = load float, float addrspace(3)* %tmp4, align 4
				%tmp6 = fadd float %tmp3, %tmp5
				%tmp7 = getelementptr inbounds float, float addrspace(3)* %arg, i32 4098
				%tmp8 = load float, float addrspace(3)* %tmp7, align 4
				%tmp9 = fadd float %tmp6, %tmp8
				%tmp10 = getelementptr inbounds float, float addrspace(3)* %arg, i32 6146
				%tmp11 = load float, float addrspace(3)* %tmp10, align 4
				%tmp12 = fadd float %tmp9, %tmp11
				%tmp13 = getelementptr inbounds float, float addrspace(3)* %arg, i32 8194
				%tmp14 = load float, float addrspace(3)* %tmp13, align 4
				%tmp15 = fadd float %tmp12, %tmp14
				%tmp16 = getelementptr inbounds float, float addrspace(3)* %arg, i32 10242
				%tmp17 = load float, float addrspace(3)* %tmp16, align 4
				%tmp18 = fadd float %tmp15, %tmp17
				store float %tmp18, float *%arg1, align 4
				ret void
				}

				; CHECK-LABEL: ds_read64_combine_stride_400:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 0x960, [[BASE]]
				; CHECK-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:50
				; CHECK-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:100 offset1:150
				; CHECK-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:200 offset1:250
				; CHECK-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:50
				define void @ds_read64_combine_stride_400(double addrspace(3)* nocapture readonly %arg, double *nocapture %arg1) {
				bb:
				%tmp = load double, double addrspace(3)* %arg, align 8
				%tmp2 = fadd double %tmp, 0.000000e+00
				%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 50
				%tmp4 = load double, double addrspace(3)* %tmp3, align 8
				%tmp5 = fadd double %tmp2, %tmp4
				%tmp6 = getelementptr inbounds double, double addrspace(3)* %arg, i32 100
				%tmp7 = load double, double addrspace(3)* %tmp6, align 8
				%tmp8 = fadd double %tmp5, %tmp7
				%tmp9 = getelementptr inbounds double, double addrspace(3)* %arg, i32 150
				%tmp10 = load double, double addrspace(3)* %tmp9, align 8
				%tmp11 = fadd double %tmp8, %tmp10
				%tmp12 = getelementptr inbounds double, double addrspace(3)* %arg, i32 200
				%tmp13 = load double, double addrspace(3)* %tmp12, align 8
				%tmp14 = fadd double %tmp11, %tmp13
				%tmp15 = getelementptr inbounds double, double addrspace(3)* %arg, i32 250
				%tmp16 = load double, double addrspace(3)* %tmp15, align 8
				%tmp17 = fadd double %tmp14, %tmp16
				%tmp18 = getelementptr inbounds double, double addrspace(3)* %arg, i32 300
				%tmp19 = load double, double addrspace(3)* %tmp18, align 8
				%tmp20 = fadd double %tmp17, %tmp19
				%tmp21 = getelementptr inbounds double, double addrspace(3)* %arg, i32 350
				%tmp22 = load double, double addrspace(3)* %tmp21, align 8
				%tmp23 = fadd double %tmp20, %tmp22
				store double %tmp23, double *%arg1, align 8
				ret void
				}

				; CHECK-LABEL: ds_read64_combine_stride_8192_shifted:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x4008, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x8008, [[BASE]]
				; CHECK-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:16
				; CHECK-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:16
				; CHECK-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:16
				define void @ds_read64_combine_stride_8192_shifted(double addrspace(3)* nocapture readonly %arg, double *nocapture %arg1) {
				bb:
				%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 1
				%tmp2 = load double, double addrspace(3)* %tmp, align 8
				%tmp3 = fadd double %tmp2, 0.000000e+00
				%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 1025
				%tmp5 = load double, double addrspace(3)* %tmp4, align 8
				%tmp6 = fadd double %tmp3, %tmp5
				%tmp7 = getelementptr inbounds double, double addrspace(3)* %arg, i32 2049
				%tmp8 = load double, double addrspace(3)* %tmp7, align 8
				%tmp9 = fadd double %tmp6, %tmp8
				%tmp10 = getelementptr inbounds double, double addrspace(3)* %arg, i32 3073
				%tmp11 = load double, double addrspace(3)* %tmp10, align 8
				%tmp12 = fadd double %tmp9, %tmp11
				%tmp13 = getelementptr inbounds double, double addrspace(3)* %arg, i32 4097
				%tmp14 = load double, double addrspace(3)* %tmp13, align 8
				%tmp15 = fadd double %tmp12, %tmp14
				%tmp16 = getelementptr inbounds double, double addrspace(3)* %arg, i32 5121
				%tmp17 = load double, double addrspace(3)* %tmp16, align 8
				%tmp18 = fadd double %tmp15, %tmp17
				store double %tmp18, double *%arg1, align 8
				ret void
				}

				; CHECK-LABEL: ds_write32_combine_stride_400:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 0x320, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x640, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x960, [[BASE]]
				; CHECK-DAG: ds_write2_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				; CHECK-DAG: ds_write2_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				; CHECK-DAG: ds_write2_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				; CHECK-DAG: ds_write2_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				define void @ds_write32_combine_stride_400(float addrspace(3)* nocapture %arg) {
				bb:
				store float 1.000000e+00, float addrspace(3)* %arg, align 4
				%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 100
				store float 1.000000e+00, float addrspace(3)* %tmp, align 4
				%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200
				store float 1.000000e+00, float addrspace(3)* %tmp1, align 4
				%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 300
				store float 1.000000e+00, float addrspace(3)* %tmp2, align 4
				%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 400
				store float 1.000000e+00, float addrspace(3)* %tmp3, align 4
				%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 500
				store float 1.000000e+00, float addrspace(3)* %tmp4, align 4
				%tmp5 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600
				store float 1.000000e+00, float addrspace(3)* %tmp5, align 4
				%tmp6 = getelementptr inbounds float, float addrspace(3)* %arg, i32 700
				store float 1.000000e+00, float addrspace(3)* %tmp6, align 4
				ret void
				}

				; CHECK-LABEL: ds_write32_combine_stride_400_back:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 0x320, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x640, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x960, [[BASE]]
				; CHECK-DAG: ds_write2_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				; CHECK-DAG: ds_write2_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				; CHECK-DAG: ds_write2_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				; CHECK-DAG: ds_write2_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
				define void @ds_write32_combine_stride_400_back(float addrspace(3)* nocapture %arg) {
				bb:
				%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 700
				store float 1.000000e+00, float addrspace(3)* %tmp, align 4
				%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600
				store float 1.000000e+00, float addrspace(3)* %tmp1, align 4
				%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 500
				store float 1.000000e+00, float addrspace(3)* %tmp2, align 4
				%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 400
				store float 1.000000e+00, float addrspace(3)* %tmp3, align 4
				%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 300
				store float 1.000000e+00, float addrspace(3)* %tmp4, align 4
				%tmp5 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200
				store float 1.000000e+00, float addrspace(3)* %tmp5, align 4
				%tmp6 = getelementptr inbounds float, float addrspace(3)* %arg, i32 100
				store float 1.000000e+00, float addrspace(3)* %tmp6, align 4
				store float 1.000000e+00, float addrspace(3)* %arg, align 4
				ret void
				}

				; CHECK-LABEL: ds_write32_combine_stride_8192:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: ds_write2st64_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
				; CHECK-DAG: ds_write2st64_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset0:64 offset1:96
				; CHECK-DAG: ds_write2st64_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset0:128 offset1:160
				; CHECK-DAG: ds_write2st64_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset0:192 offset1:224
				define void @ds_write32_combine_stride_8192(float addrspace(3)* nocapture %arg) {
				bb:
				store float 1.000000e+00, float addrspace(3)* %arg, align 4
				%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 2048
				store float 1.000000e+00, float addrspace(3)* %tmp, align 4
				%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 4096
				store float 1.000000e+00, float addrspace(3)* %tmp1, align 4
				%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 6144
				store float 1.000000e+00, float addrspace(3)* %tmp2, align 4
				%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 8192
				store float 1.000000e+00, float addrspace(3)* %tmp3, align 4
				%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 10240
				store float 1.000000e+00, float addrspace(3)* %tmp4, align 4
				%tmp5 = getelementptr inbounds float, float addrspace(3)* %arg, i32 12288
				store float 1.000000e+00, float addrspace(3)* %tmp5, align 4
				%tmp6 = getelementptr inbounds float, float addrspace(3)* %arg, i32 14336
				store float 1.000000e+00, float addrspace(3)* %tmp6, align 4
				ret void
				}

				; CHECK-LABEL: ds_write32_combine_stride_8192_shifted:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 4, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x4004, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x8004, [[BASE]]
				; CHECK-DAG: ds_write2st64_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
				; CHECK-DAG: ds_write2st64_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
				; CHECK-DAG: ds_write2st64_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
				define void @ds_write32_combine_stride_8192_shifted(float addrspace(3)* nocapture %arg) {
				bb:
				%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 1
				store float 1.000000e+00, float addrspace(3)* %tmp, align 4
				%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 2049
				store float 1.000000e+00, float addrspace(3)* %tmp1, align 4
				%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 4097
				store float 1.000000e+00, float addrspace(3)* %tmp2, align 4
				%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 6145
				store float 1.000000e+00, float addrspace(3)* %tmp3, align 4
				%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 8193
				store float 1.000000e+00, float addrspace(3)* %tmp4, align 4
				%tmp5 = getelementptr inbounds float, float addrspace(3)* %arg, i32 10241
				store float 1.000000e+00, float addrspace(3)* %tmp5, align 4
				ret void
				}

				; CHECK-LABEL: ds_write64_combine_stride_400:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 0x960, [[BASE]]
				; CHECK-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:50
				; CHECK-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:100 offset1:150
				; CHECK-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:200 offset1:250
				; CHECK-DAG: ds_write2_b64 [[B1]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:50
				define void @ds_write64_combine_stride_400(double addrspace(3)* nocapture %arg) {
				bb:
				store double 1.000000e+00, double addrspace(3)* %arg, align 8
				%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 50
				store double 1.000000e+00, double addrspace(3)* %tmp, align 8
				%tmp1 = getelementptr inbounds double, double addrspace(3)* %arg, i32 100
				store double 1.000000e+00, double addrspace(3)* %tmp1, align 8
				%tmp2 = getelementptr inbounds double, double addrspace(3)* %arg, i32 150
				store double 1.000000e+00, double addrspace(3)* %tmp2, align 8
				%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 200
				store double 1.000000e+00, double addrspace(3)* %tmp3, align 8
				%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 250
				store double 1.000000e+00, double addrspace(3)* %tmp4, align 8
				%tmp5 = getelementptr inbounds double, double addrspace(3)* %arg, i32 300
				store double 1.000000e+00, double addrspace(3)* %tmp5, align 8
				%tmp6 = getelementptr inbounds double, double addrspace(3)* %arg, i32 350
				store double 1.000000e+00, double addrspace(3)* %tmp6, align 8
				ret void
				}

				; CHECK-LABEL: ds_write64_combine_stride_8192_shifted:
				; CHECK: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
				; CHECK: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]
				; CHECK-DAG: v_add_i32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B2:v[0-9]+]], vcc, 0x4008, [[BASE]]
				; CHECK-DAG: v_add_i32_e32 [[B3:v[0-9]+]], vcc, 0x8008, [[BASE]]
				; CHECK-DAG: ds_write2st64_b64 [[B1]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:16
				; CHECK-DAG: ds_write2st64_b64 [[B2]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:16
				; CHECK-DAG: ds_write2st64_b64 [[B3]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:16
				define void @ds_write64_combine_stride_8192_shifted(double addrspace(3)* nocapture %arg) {
				bb:
				%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 1
				store double 1.000000e+00, double addrspace(3)* %tmp, align 8
				%tmp1 = getelementptr inbounds double, double addrspace(3)* %arg, i32 1025
				store double 1.000000e+00, double addrspace(3)* %tmp1, align 8
				%tmp2 = getelementptr inbounds double, double addrspace(3)* %arg, i32 2049
				store double 1.000000e+00, double addrspace(3)* %tmp2, align 8
				%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 3073
				store double 1.000000e+00, double addrspace(3)* %tmp3, align 8
				%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 4097
				store double 1.000000e+00, double addrspace(3)* %tmp4, align 8
				%tmp5 = getelementptr inbounds double, double addrspace(3)* %arg, i32 5121
				store double 1.000000e+00, double addrspace(3)* %tmp5, align 8
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Combine DS operations with offsets bigger than byteClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 95040

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

test/CodeGen/AMDGPU/ds-combine-large-stride.ll

[AMDGPU] Combine DS operations with offsets bigger than byte
ClosedPublic