This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Target/
-
llvm/
-
Target/
-
TargetInstrInfo.h
-
lib/
-
CodeGen/
-
MachineScheduler.cpp
-
Target/
-
AArch64/
-
AArch64InstrInfo.h
-
AArch64InstrInfo.cpp
-
AMDGPU/
-
SIInstrInfo.h
-
SIInstrInfo.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
add.i16.ll
-
ctpop.ll
-
fadd.f16.ll
-
flat-load-clustering.mir
-
si-triv-disjoint-mem-access.ll
-
sub.i16.ll

Differential D37698

Allow target to decide when to cluster loads/stores in misched
ClosedPublic

Authored by rampitec on Sep 11 2017, 10:30 AM.

Download Raw Diff

Details

Reviewers

vpykhtin
junbuml
MatzeB

Commits

rG7fe9a5d9b423: Allow target to decide when to cluster loads/stores in misched
rL313208: Allow target to decide when to cluster loads/stores in misched

Summary

MachineScheduler when clustering loads or stores checks if base
pointers point to the same memory. This check is done through
comparison of base registers of two memory instructions. This
works fine when instructions have separate offset operand. If
they require a full calculated pointer such instructions can
never be clustered according to such logic.

Changed shouldClusterMemOps to accept base registers as well and
let it decide what to do about it.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Sep 11 2017, 10:30 AM

Herald added subscribers: javed.absar, nhaehnle, arsenm. · View Herald TranscriptSep 11 2017, 10:30 AM

I'd suggest changing the name to doMemOpsHaveSameBase[Ptr]

Renamed callback as suggested by Brian.
Cleanup the test.

arsenm added inline comments.Sep 11 2017, 11:03 AM

include/llvm/Target/TargetInstrInfo.h
1094–1095 ↗	(On Diff #114638)	This needs more clarification on how it's different from getMemOpBaseRegImmOfs.

Updated comment as suggested.

hfinkel added a subscriber: hfinkel.Sep 11 2017, 7:53 PM

hfinkel added inline comments.

lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗	(On Diff #114639)	Can you make this the default implementation?

rampitec added inline comments.Sep 11 2017, 8:44 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗	(On Diff #114639)	This will result in both improvements an regressions across the board as it will increase amount of clustering dramatically. Having no access to all HW and benchmark suites and targets I left the default behavior intact. In particular I had to limit clustering in amdgpu below to maintain reasonable average performance, which I cannot do for all other targets. I.e. I would be more than happy to make it default behavior, but I think much more people should agree to spend their time tuning it after for their targets. What we can do, if this is eventually clonned to the majority of targets requiring clustering, then change the default. What others think? Is this a generally desired default behaviour?

hfinkel added inline comments.Sep 11 2017, 9:12 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗	(On Diff #114639)	If we generally adopted your current methodology, then we'd very rarely get any improvements to the common components. We need to do things that generally make sense. People are encouraged to benchmark them on a variety of targets, and we'll often give them a heads up (on llvm-dev and/or pinging target code owners), but otherwise, they'll notice post-commit if things regress. That having been said... As currently defined by this patch, this implementation is how I'd recommend implementing the target-independent code for the callback. The question is then: is this the right callback to add? Is it too general? Do you think that, some tuning aside, this is the right thing to do on most/all relevant targets? Do we actually want this callback or do we want just a Boolean switch (this behavior vs. the old behavior)? [I can certainly imagine hardware store combining where the base register actually matters, and I can also imagine hardware where the address is all that matters, but I'd assume most modern hardware wants the latter.]

rampitec added inline comments.Sep 11 2017, 9:33 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗	(On Diff #114639)	In general there are several questionable things about it if we generalize it for all targets: It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions. It is not a fastest code out there as it uses GetUnderlyingObject. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort. All of that does not prohibit it to be a default implementation though if can be overwritten.

SIInstrInfo::doMemOpsHaveSameBasePtr looks generic enough to be default indeed but should we commit this first and then make doMemOpsHaveSameBasePtr default so we could rollback to this one in case of severe regressions? LGTM by the way.

This revision is now accepted and ready to land.Sep 12 2017, 9:07 AM

In D37698#868147, @vpykhtin wrote:

SIInstrInfo::doMemOpsHaveSameBasePtr looks generic enough to be default indeed but should we commit this first and then make doMemOpsHaveSameBasePtr default so we could rollback to this one in case of severe regressions? LGTM by the way.

Good point. I will create a separate patch to switch the default implementation.

rampitec mentioned this in D37755: Change default implementation of doMemOpsHaveSameBasePtr.Sep 12 2017, 12:09 PM

rampitec added a child revision: D37755: Change default implementation of doMemOpsHaveSameBasePtr.Sep 12 2017, 12:11 PM

rampitec marked 4 inline comments as done.

rampitec added inline comments.

lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗	(On Diff #114639)	D37755 moves AMDGPU implementation into core.

hfinkel added inline comments.Sep 13 2017, 2:05 AM

lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗	(On Diff #114639)	To comment on the conditions: In general there are several questionable things about it if we generalize it for all targets: It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions. This is fair, although the current clustering logic does not consider this. It is not a fastest code out there as it uses GetUnderlyingObject. We already essentially call this once per MMO-carrying instruction in ScheduleDAGInstrs.cpp. Doing it again here does not seem so bad. Unless we're now doing it O(N^2) times. Are we? Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort. Targets should maintain reasonable MMOs, and as far as I know, all (in tree) targets now do.
360 ↗	(On Diff #114639)	D37755 moves AMDGPU implementation into core. Thanks for posting the follow-up patch, but I don't think that we should do that. Looking at how AArch64 uses the existing implementation, I do think that we should rename the callback to be something specific to store clustering. As far as I can tell from the discussion in D18376, the store clustering, as used by AArch64, is specifically setup to better enable the AArch64LoadStoreOptimizer, and that pass is specifically looking for stores that share the same base register (within some small window, which is why this helps). As a result, I think it would be good to make the callback specific to store clustering.

rampitec marked an inline comment as done.Sep 13 2017, 8:57 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗	(On Diff #114639)	Current implementation is O(N). Agree on the rest of conditions comments too. Then it looks like for the targets in the tree only AMDGPU and AArch64 call createLoadClusterDAGMutation and createStoreClusterDAGMutation, and AArch needs the current implementation as it uses it for further combining. So I agree, D37755 is premature. I am not sure about renaming though as both targets need both load and store clustering. For AMDGPU we definitely want it for both loads and stores.

When seeing the name doMemOpsHaveSameBasePtr I would expect this to do exactly BaseReg1 == BaseReg2. Can you explain how you happen to have cases with different base pointers that still reference the same object?
Given that this callback is very specific to the selectiondag clustering mutator, maybe we better add a callback to createLoadClusterDAGMutation/createStoreClusterDAGMutation rather than in TargetInstructionInfo. That way you could also choose a better name like shouldCluster.
We should indeed not change the behavior for aarch64, as the mutation mainly exists there to enable formation of load/store double instructions which require the same base pointer.

In D37698#869839, @MatzeB wrote:

When seeing the name doMemOpsHaveSameBasePtr I would expect this to do exactly BaseReg1 == BaseReg2. Can you explain how you happen to have cases with different base pointers that still reference the same object?

That can only work if instruction has base register and offset operands. AMDGPU flat_load and flat_store instructions do not, their address operand is a single register. If you need to load two values from the same base pointer you do:

flat_load %ptr1
%ptr2 = %ptr1 + offset
flat_load %ptr2

Given that this callback is very specific to the selectiondag clustering mutator, maybe we better add a callback to createLoadClusterDAGMutation/createStoreClusterDAGMutation rather than in TargetInstructionInfo. That way you could also choose a better name like shouldCluster.

I cannot directly inherit LoadClusterMutation or StoreClusterMutation, it is defined in the MachineScheduler.cpp, and then if I move it to common interface I do not want to duplicate the code in both. I can pass a lambda/functor into create(Load|Store)ClusterDAGMutation as a last argument. Is that what you have in mind?
As an alternative I can just drop this portion of code from BaseMemOpClusterMutation::clusterNeighboringMemOps() completely:

if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
  ClusterLength = 1;
  continue;
}

and let TII->shouldClusterMemOps() decide. Maybe it is better. Thoughts?

We should indeed not change the behavior for aarch64, as the mutation mainly exists there to enable formation of load/store double instructions which require the same base pointer.

Agree. I am about to close D37755.

In D37698#869935, @rampitec wrote:

In D37698#869839, @MatzeB wrote:

When seeing the name doMemOpsHaveSameBasePtr I would expect this to do exactly BaseReg1 == BaseReg2. Can you explain how you happen to have cases with different base pointers that still reference the same object?

That can only work if instruction has base register and offset operands. AMDGPU flat_load and flat_store instructions do not, their address operand is a single register. If you need to load two values from the same base pointer you do:

flat_load %ptr1
%ptr2 = %ptr1 + offset
flat_load %ptr2

Given that this callback is very specific to the selectiondag clustering mutator, maybe we better add a callback to createLoadClusterDAGMutation/createStoreClusterDAGMutation rather than in TargetInstructionInfo. That way you could also choose a better name like shouldCluster.

I cannot directly inherit LoadClusterMutation or StoreClusterMutation, it is defined in the MachineScheduler.cpp, and then if I move it to common interface I do not want to duplicate the code in both. I can pass a lambda/functor into create(Load|Store)ClusterDAGMutation as a last argument. Is that what you have in mind?

While you could just as well move the class declaration into a header an extra function argument is probably easiest.

As an alternative I can just drop this portion of code from BaseMemOpClusterMutation::clusterNeighboringMemOps() completely:
if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
  ClusterLength = 1;
  continue;
}
and let TII->shouldClusterMemOps() decide. Maybe it is better. Thoughts?

Indeed, that sounds like the best/easiest solution so far!

We should indeed not change the behavior for aarch64, as the mutation mainly exists there to enable formation of load/store double instructions which require the same base pointer.

Agree. I am about to close D37755.

In D37698#870072, @MatzeB wrote:
In D37698#869935, @rampitec wrote:
As an alternative I can just drop this portion of code from BaseMemOpClusterMutation::clusterNeighboringMemOps() completely:
if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
  ClusterLength = 1;
  continue;
}
and let TII->shouldClusterMemOps() decide. Maybe it is better. Thoughts?
Indeed, that sounds like the best/easiest solution so far!

I will do so then. Looks cleaner to me.

Changed to let shouldClusterMemOps decide as discussed.

This revision is now accepted and ready to land.Sep 13 2017, 2:39 PM

LGTM

Closed by commit rL313208: Allow target to decide when to cluster loads/stores in misched (authored by rampitec). · Explain WhySep 13 2017, 3:22 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Target/

TargetInstrInfo.h

4 lines

lib/

CodeGen/

MachineScheduler.cpp

8 lines

Target/

AArch64/

AArch64InstrInfo.h

3 lines

AArch64InstrInfo.cpp

5 lines

AMDGPU/

SIInstrInfo.h

3 lines

SIInstrInfo.cpp

38 lines

test/

CodeGen/

AMDGPU/

add.i16.ll

4 lines

ctpop.ll

8 lines

fadd.f16.ll

6 lines

flat-load-clustering.mir

77 lines

si-triv-disjoint-mem-access.ll

10 lines

sub.i16.ll

4 lines

Diff 115129

llvm/trunk/include/llvm/Target/TargetInstrInfo.h

Show First 20 Lines • Show All 1,091 Lines • ▼ Show 20 Lines	public:
}		}

/// Returns true if the two given memory operations should be scheduled		/// Returns true if the two given memory operations should be scheduled
/// adjacent. Note that you have to add:		/// adjacent. Note that you have to add:
/// DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));		/// DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
/// or		/// or
/// DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));		/// DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
/// to TargetPassConfig::createMachineScheduler() to have an effect.		/// to TargetPassConfig::createMachineScheduler() to have an effect.
virtual bool shouldClusterMemOps(MachineInstr &FirstLdSt,		virtual bool shouldClusterMemOps(MachineInstr &FirstLdSt, unsigned BaseReg1,
MachineInstr &SecondLdSt,		MachineInstr &SecondLdSt, unsigned BaseReg2,
unsigned NumLoads) const {		unsigned NumLoads) const {
llvm_unreachable("target did not implement shouldClusterMemOps()");		llvm_unreachable("target did not implement shouldClusterMemOps()");
}		}

/// Reverses the branch condition of the specified condition list,		/// Reverses the branch condition of the specified condition list,
/// returning false on success and true if it cannot be reversed.		/// returning false on success and true if it cannot be reversed.
virtual bool		virtual bool
reverseBranchCondition(SmallVectorImpl<MachineOperand> &Cond) const {		reverseBranchCondition(SmallVectorImpl<MachineOperand> &Cond) const {
▲ Show 20 Lines • Show All 562 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,555 Lines • ▼ Show 20 Lines	if (TII->getMemOpBaseRegImmOfs(*SU->getInstr(), BaseReg, Offset, TRI))
MemOpRecords.push_back(MemOpInfo(SU, BaseReg, Offset));		MemOpRecords.push_back(MemOpInfo(SU, BaseReg, Offset));
}		}
if (MemOpRecords.size() < 2)		if (MemOpRecords.size() < 2)
return;		return;

std::sort(MemOpRecords.begin(), MemOpRecords.end());		std::sort(MemOpRecords.begin(), MemOpRecords.end());
unsigned ClusterLength = 1;		unsigned ClusterLength = 1;
for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {		for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {
if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
ClusterLength = 1;
continue;
}

SUnit *SUa = MemOpRecords[Idx].SU;		SUnit *SUa = MemOpRecords[Idx].SU;
SUnit *SUb = MemOpRecords[Idx+1].SU;		SUnit *SUb = MemOpRecords[Idx+1].SU;
if (TII->shouldClusterMemOps(SUa->getInstr(), SUb->getInstr(),		if (TII->shouldClusterMemOps(*SUa->getInstr(), MemOpRecords[Idx].BaseReg,
		*SUb->getInstr(), MemOpRecords[Idx+1].BaseReg,
ClusterLength) &&		ClusterLength) &&
DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {		DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("		DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
<< SUb->NodeNum << ")\n");		<< SUb->NodeNum << ")\n");
// Copy successor edges from SUa to SUb. Interleaving computation		// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.		// dependent on SUa can prevent load combining due to register reuse.
// Predecessor edges do not need to be copied from SUb to SUa since nearby		// Predecessor edges do not need to be copied from SUb to SUa since nearby
// loads should have effectively the same inputs.		// loads should have effectively the same inputs.
▲ Show 20 Lines • Show All 2,064 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64InstrInfo.h

Show First 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	public:

/// \brief Returns true if opcode \p Opc is a memory operation. If it is, set		/// \brief Returns true if opcode \p Opc is a memory operation. If it is, set
/// \p Scale, \p Width, \p MinOffset, and \p MaxOffset accordingly.		/// \p Scale, \p Width, \p MinOffset, and \p MaxOffset accordingly.
///		///
/// For unscaled instructions, \p Scale is set to 1.		/// For unscaled instructions, \p Scale is set to 1.
bool getMemOpInfo(unsigned Opcode, unsigned &Scale, unsigned &Width,		bool getMemOpInfo(unsigned Opcode, unsigned &Scale, unsigned &Width,
int64_t &MinOffset, int64_t &MaxOffset) const;		int64_t &MinOffset, int64_t &MaxOffset) const;

bool shouldClusterMemOps(MachineInstr &FirstLdSt, MachineInstr &SecondLdSt,		bool shouldClusterMemOps(MachineInstr &FirstLdSt, unsigned BaseReg1,
		MachineInstr &SecondLdSt, unsigned BaseReg2,
unsigned NumLoads) const override;		unsigned NumLoads) const override;

void copyPhysRegTuple(MachineBasicBlock &MBB, MachineBasicBlock::iterator I,		void copyPhysRegTuple(MachineBasicBlock &MBB, MachineBasicBlock::iterator I,
const DebugLoc &DL, unsigned DestReg, unsigned SrcReg,		const DebugLoc &DL, unsigned DestReg, unsigned SrcReg,
bool KillSrc, unsigned Opcode,		bool KillSrc, unsigned Opcode,
llvm::ArrayRef<unsigned> Indices) const;		llvm::ArrayRef<unsigned> Indices) const;
void copyPhysReg(MachineBasicBlock &MBB, MachineBasicBlock::iterator I,		void copyPhysReg(MachineBasicBlock &MBB, MachineBasicBlock::iterator I,
const DebugLoc &DL, unsigned DestReg, unsigned SrcReg,		const DebugLoc &DL, unsigned DestReg, unsigned SrcReg,
▲ Show 20 Lines • Show All 206 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64InstrInfo.cpp

Show First 20 Lines • Show All 2,098 Lines • ▼ Show 20 Lines	static bool canPairLdStOpc(unsigned FirstOpc, unsigned SecondOpc) {
// These instructions can't be paired based on their opcodes.		// These instructions can't be paired based on their opcodes.
return false;		return false;
}		}

/// Detect opportunities for ldp/stp formation.		/// Detect opportunities for ldp/stp formation.
///		///
/// Only called for LdSt for which getMemOpBaseRegImmOfs returns true.		/// Only called for LdSt for which getMemOpBaseRegImmOfs returns true.
bool AArch64InstrInfo::shouldClusterMemOps(MachineInstr &FirstLdSt,		bool AArch64InstrInfo::shouldClusterMemOps(MachineInstr &FirstLdSt,
		unsigned BaseReg1,
MachineInstr &SecondLdSt,		MachineInstr &SecondLdSt,
		unsigned BaseReg2,
unsigned NumLoads) const {		unsigned NumLoads) const {
		if (BaseReg1 != BaseReg2)
		return false;

// Only cluster up to a single pair.		// Only cluster up to a single pair.
if (NumLoads > 1)		if (NumLoads > 1)
return false;		return false;

if (!isPairableLdStInst(FirstLdSt) \|\| !isPairableLdStInst(SecondLdSt))		if (!isPairableLdStInst(FirstLdSt) \|\| !isPairableLdStInst(SecondLdSt))
return false;		return false;

// Can we pair these instructions based on their opcodes?		// Can we pair these instructions based on their opcodes?
▲ Show 20 Lines • Show All 2,561 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	public:
bool areLoadsFromSameBasePtr(SDNode Load1, SDNode Load2,		bool areLoadsFromSameBasePtr(SDNode Load1, SDNode Load2,
int64_t &Offset1,		int64_t &Offset1,
int64_t &Offset2) const override;		int64_t &Offset2) const override;

bool getMemOpBaseRegImmOfs(MachineInstr &LdSt, unsigned &BaseReg,		bool getMemOpBaseRegImmOfs(MachineInstr &LdSt, unsigned &BaseReg,
int64_t &Offset,		int64_t &Offset,
const TargetRegisterInfo *TRI) const final;		const TargetRegisterInfo *TRI) const final;

bool shouldClusterMemOps(MachineInstr &FirstLdSt, MachineInstr &SecondLdSt,		bool shouldClusterMemOps(MachineInstr &FirstLdSt, unsigned BaseReg1,
		MachineInstr &SecondLdSt, unsigned BaseReg2,
unsigned NumLoads) const final;		unsigned NumLoads) const final;

void copyPhysReg(MachineBasicBlock &MBB, MachineBasicBlock::iterator MI,		void copyPhysReg(MachineBasicBlock &MBB, MachineBasicBlock::iterator MI,
const DebugLoc &DL, unsigned DestReg, unsigned SrcReg,		const DebugLoc &DL, unsigned DestReg, unsigned SrcReg,
bool KillSrc) const override;		bool KillSrc) const override;

unsigned calculateLDSSpillAddress(MachineBasicBlock &MBB, MachineInstr &MI,		unsigned calculateLDSSpillAddress(MachineBasicBlock &MBB, MachineInstr &MI,
RegScavenger *RS, unsigned TmpReg,		RegScavenger *RS, unsigned TmpReg,
▲ Show 20 Lines • Show All 758 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp

Show All 21 Lines
#include "Utils/AMDGPUBaseInfo.h"		#include "Utils/AMDGPUBaseInfo.h"
#include "llvm/ADT/APInt.h"		#include "llvm/ADT/APInt.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/MemoryLocation.h"		#include "llvm/Analysis/MemoryLocation.h"
		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineInstrBundle.h"		#include "llvm/CodeGen/MachineInstrBundle.h"
#include "llvm/CodeGen/MachineMemOperand.h"		#include "llvm/CodeGen/MachineMemOperand.h"
#include "llvm/CodeGen/MachineOperand.h"		#include "llvm/CodeGen/MachineOperand.h"
▲ Show 20 Lines • Show All 313 Lines • ▼ Show 20 Lines	if (isFLAT(LdSt)) {

Offset = getNamedOperand(LdSt, AMDGPU::OpName::offset)->getImm();		Offset = getNamedOperand(LdSt, AMDGPU::OpName::offset)->getImm();
return true;		return true;
}		}

return false;		return false;
}		}

		static bool memOpsHaveSameBasePtr(const MachineInstr &MI1, unsigned BaseReg1,
		const MachineInstr &MI2, unsigned BaseReg2) {
		if (BaseReg1 == BaseReg2)
		return true;

		if (!MI1.hasOneMemOperand() \|\| !MI2.hasOneMemOperand())
		return false;

		auto MO1 = *MI1.memoperands_begin();
		auto MO2 = *MI2.memoperands_begin();
		if (MO1->getAddrSpace() != MO2->getAddrSpace())
		return false;

		auto Base1 = MO1->getValue();
		auto Base2 = MO2->getValue();
		if (!Base1 \|\| !Base2)
		return false;
		const MachineFunction &MF = *MI1.getParent()->getParent();
		const DataLayout &DL = MF.getFunction()->getParent()->getDataLayout();
		Base1 = GetUnderlyingObject(Base1, DL);
		Base2 = GetUnderlyingObject(Base1, DL);

		if (isa<UndefValue>(Base1) \|\| isa<UndefValue>(Base2))
		return false;

		return Base1 == Base2;
		}

bool SIInstrInfo::shouldClusterMemOps(MachineInstr &FirstLdSt,		bool SIInstrInfo::shouldClusterMemOps(MachineInstr &FirstLdSt,
		unsigned BaseReg1,
MachineInstr &SecondLdSt,		MachineInstr &SecondLdSt,
		unsigned BaseReg2,
unsigned NumLoads) const {		unsigned NumLoads) const {
		if (!memOpsHaveSameBasePtr(FirstLdSt, BaseReg1, SecondLdSt, BaseReg2))
		return false;

const MachineOperand *FirstDst = nullptr;		const MachineOperand *FirstDst = nullptr;
const MachineOperand *SecondDst = nullptr;		const MachineOperand *SecondDst = nullptr;

if ((isMUBUF(FirstLdSt) && isMUBUF(SecondLdSt)) \|\|		if ((isMUBUF(FirstLdSt) && isMUBUF(SecondLdSt)) \|\|
(isMTBUF(FirstLdSt) && isMTBUF(SecondLdSt)) \|\|		(isMTBUF(FirstLdSt) && isMTBUF(SecondLdSt)) \|\|
(isFLAT(FirstLdSt) && isFLAT(SecondLdSt))) {		(isFLAT(FirstLdSt) && isFLAT(SecondLdSt))) {
		const unsigned MaxGlobalLoadCluster = 6;
		if (NumLoads > MaxGlobalLoadCluster)
		return false;

FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdata);		FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdata);
if (!FirstDst)		if (!FirstDst)
FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdst);		FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdst);
SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdata);		SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdata);
if (!SecondDst)		if (!SecondDst)
SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdst);		SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdst);
} else if (isSMRD(FirstLdSt) && isSMRD(SecondLdSt)) {		} else if (isSMRD(FirstLdSt) && isSMRD(SecondLdSt)) {
FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::sdst);		FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::sdst);
▲ Show 20 Lines • Show All 4,095 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/add.i16.ll

Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_test_add_i16_zext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
store i64 %ext, i64 addrspace(1)* %out		store i64 %ext, i64 addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Need to handle non-uniform case for function below (load without gep).		; FIXME: Need to handle non-uniform case for function below (load without gep).
; GCN-LABEL: {{^}}v_test_add_i16_sext_to_i32:		; GCN-LABEL: {{^}}v_test_add_i16_sext_to_i32:
; VI: flat_load_ushort [[A:v[0-9]+]]		; VI: flat_load_ushort [[A:v[0-9]+]]
; VI: flat_load_ushort [[B:v[0-9]+]]		; VI: flat_load_ushort [[B:v[0-9]+]]
; VI: v_add_u16_e32 [[ADD:v[0-9]+]], [[A]], [[B]]		; VI: v_add_u16_e32 [[ADD:v[0-9]+]], [[B]], [[A]]
; VI-NEXT: v_bfe_i32 [[SEXT:v[0-9]+]], [[ADD]], 0, 16		; VI-NEXT: v_bfe_i32 [[SEXT:v[0-9]+]], [[ADD]], 0, 16
; VI-NEXT: buffer_store_dword [[SEXT]]		; VI-NEXT: buffer_store_dword [[SEXT]]
define amdgpu_kernel void @v_test_add_i16_sext_to_i32(i32 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {		define amdgpu_kernel void @v_test_add_i16_sext_to_i32(i32 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.out = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 %tid		%gep.out = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 %tid
%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid		%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid
%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid		%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid
%a = load i16, i16 addrspace(1)* %gep.in0		%a = load i16, i16 addrspace(1)* %gep.in0
%b = load i16, i16 addrspace(1)* %gep.in1		%b = load i16, i16 addrspace(1)* %gep.in1
%add = add i16 %a, %b		%add = add i16 %a, %b
%ext = sext i16 %add to i32		%ext = sext i16 %add to i32
store i32 %ext, i32 addrspace(1)* %out		store i32 %ext, i32 addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Need to handle non-uniform case for function below (load without gep).		; FIXME: Need to handle non-uniform case for function below (load without gep).
; GCN-LABEL: {{^}}v_test_add_i16_sext_to_i64:		; GCN-LABEL: {{^}}v_test_add_i16_sext_to_i64:
; VI: flat_load_ushort [[A:v[0-9]+]]		; VI: flat_load_ushort [[A:v[0-9]+]]
; VI: flat_load_ushort [[B:v[0-9]+]]		; VI: flat_load_ushort [[B:v[0-9]+]]
; VI: v_add_u16_e32 [[ADD:v[0-9]+]], [[A]], [[B]]		; VI: v_add_u16_e32 [[ADD:v[0-9]+]], [[B]], [[A]]
; VI-NEXT: v_bfe_i32 v[[LO:[0-9]+]], [[ADD]], 0, 16		; VI-NEXT: v_bfe_i32 v[[LO:[0-9]+]], [[ADD]], 0, 16
; VI-NEXT: v_ashrrev_i32_e32 v[[HI:[0-9]+]], 31, v[[LO]]		; VI-NEXT: v_ashrrev_i32_e32 v[[HI:[0-9]+]], 31, v[[LO]]
; VI-NEXT: buffer_store_dwordx2 v{{\[}}[[LO]]:[[HI]]{{\]}}		; VI-NEXT: buffer_store_dwordx2 v{{\[}}[[LO]]:[[HI]]{{\]}}
define amdgpu_kernel void @v_test_add_i16_sext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {		define amdgpu_kernel void @v_test_add_i16_sext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid		%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid
%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid		%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid
%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid		%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid
Show All 12 Lines

llvm/trunk/test/CodeGen/AMDGPU/ctpop.ll

Show All 36 Lines	define amdgpu_kernel void @v_ctpop_i32(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) nounwind {
%in.gep = getelementptr i32, i32 addrspace(1)* %in, i32 %tid		%in.gep = getelementptr i32, i32 addrspace(1)* %in, i32 %tid
%val = load i32, i32 addrspace(1)* %in.gep, align 4		%val = load i32, i32 addrspace(1)* %in.gep, align 4
%ctpop = call i32 @llvm.ctpop.i32(i32 %val) nounwind readnone		%ctpop = call i32 @llvm.ctpop.i32(i32 %val) nounwind readnone
store i32 %ctpop, i32 addrspace(1)* %out, align 4		store i32 %ctpop, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

; FUNC-LABEL: {{^}}v_ctpop_add_chain_i32:		; FUNC-LABEL: {{^}}v_ctpop_add_chain_i32:
; GCN: {{buffer\|flat}}_load_dword [[VAL0:v[0-9]+]],		; SI: buffer_load_dword [[VAL0:v[0-9]+]],
; GCN: {{buffer\|flat}}_load_dword [[VAL1:v[0-9]+]],		; SI: buffer_load_dword [[VAL1:v[0-9]+]],
		; VI: flat_load_dword [[VAL1:v[0-9]+]],
		; VI: flat_load_dword [[VAL0:v[0-9]+]],
; GCN: v_bcnt_u32_b32{{(_e64)*}} [[MIDRESULT:v[0-9]+]], [[VAL1]], 0		; GCN: v_bcnt_u32_b32{{(_e64)*}} [[MIDRESULT:v[0-9]+]], [[VAL1]], 0
; SI: v_bcnt_u32_b32_e32 [[RESULT:v[0-9]+]], [[VAL0]], [[MIDRESULT]]		; SI: v_bcnt_u32_b32_e32 [[RESULT:v[0-9]+]], [[VAL0]], [[MIDRESULT]]
; VI: v_bcnt_u32_b32 [[RESULT:v[0-9]+]], [[VAL0]], [[MIDRESULT]]		; VI: v_bcnt_u32_b32 [[RESULT:v[0-9]+]], [[VAL0]], [[MIDRESULT]]
; GCN: buffer_store_dword [[RESULT]],		; GCN: buffer_store_dword [[RESULT]],
; GCN: s_endpgm		; GCN: s_endpgm

; EG: BCNT_INT		; EG: BCNT_INT
; EG: BCNT_INT		; EG: BCNT_INT
▲ Show 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_ctpop_i32_add_var_inv(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in, i32 %const) nounwind {
store i32 %add, i32 addrspace(1)* %out, align 4		store i32 %add, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

; FUNC-LABEL: {{^}}v_ctpop_i32_add_vvar_inv:		; FUNC-LABEL: {{^}}v_ctpop_i32_add_vvar_inv:
; SI: buffer_load_dword [[VAR:v[0-9]+]], v[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0 addr64		; SI: buffer_load_dword [[VAR:v[0-9]+]], v[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0 addr64
; SI: buffer_load_dword [[VAL:v[0-9]+]], v[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0 addr64		; SI: buffer_load_dword [[VAL:v[0-9]+]], v[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0 addr64
; SI: v_bcnt_u32_b32_e32 [[RESULT:v[0-9]+]], [[VAR]], [[VAL]]		; SI: v_bcnt_u32_b32_e32 [[RESULT:v[0-9]+]], [[VAR]], [[VAL]]
; VI: flat_load_dword [[VAL:v[0-9]+]], v[{{[0-9]+:[0-9]+}}]
; VI: flat_load_dword [[VAR:v[0-9]+]], v[{{[0-9]+:[0-9]+}}]		; VI: flat_load_dword [[VAR:v[0-9]+]], v[{{[0-9]+:[0-9]+}}]
		; VI: flat_load_dword [[VAL:v[0-9]+]], v[{{[0-9]+:[0-9]+}}]
; VI: v_bcnt_u32_b32 [[RESULT:v[0-9]+]], [[VAL]], [[VAR]]		; VI: v_bcnt_u32_b32 [[RESULT:v[0-9]+]], [[VAL]], [[VAR]]
; GCN: buffer_store_dword [[RESULT]],		; GCN: buffer_store_dword [[RESULT]],
; GCN: s_endpgm		; GCN: s_endpgm

; EG: BCNT_INT		; EG: BCNT_INT
define amdgpu_kernel void @v_ctpop_i32_add_vvar_inv(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in, i32 addrspace(1)* noalias %constptr) nounwind {		define amdgpu_kernel void @v_ctpop_i32_add_vvar_inv(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in, i32 addrspace(1)* noalias %constptr) nounwind {
%tid = call i32 @llvm.r600.read.tidig.x()		%tid = call i32 @llvm.r600.read.tidig.x()
%in.gep = getelementptr i32, i32 addrspace(1)* %in, i32 %tid		%in.gep = getelementptr i32, i32 addrspace(1)* %in, i32 %tid
Show All 39 Lines

llvm/trunk/test/CodeGen/AMDGPU/fadd.f16.ll

	Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	entry:			entry:
	%a.val = load half, half addrspace(1)* %a			%a.val = load half, half addrspace(1)* %a
	%r.val = fadd half %a.val, 2.0			%r.val = fadd half %a.val, 2.0
	store half %r.val, half addrspace(1)* %r			store half %r.val, half addrspace(1)* %r
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}fadd_v2f16:			; GCN-LABEL: {{^}}fadd_v2f16:
	; GCN: {{buffer\|flat}}_load_dword v[[A_V2_F16:[0-9]+]]			; SI: buffer_load_dword v[[A_V2_F16:[0-9]+]]
	; GCN: {{buffer\|flat}}_load_dword v[[B_V2_F16:[0-9]+]]			; SI: buffer_load_dword v[[B_V2_F16:[0-9]+]]
				; VI: flat_load_dword v[[B_V2_F16:[0-9]+]]
				; VI: flat_load_dword v[[A_V2_F16:[0-9]+]]

	; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]			; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]
	; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_V2_F16]]			; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_V2_F16]]
	; SI: v_cvt_f32_f16_e32 v[[B_F32_0:[0-9]+]], v[[B_V2_F16]]			; SI: v_cvt_f32_f16_e32 v[[B_F32_0:[0-9]+]], v[[B_V2_F16]]
	; SI: v_lshrrev_b32_e32 v[[B_F16_1:[0-9]+]], 16, v[[B_V2_F16]]			; SI: v_lshrrev_b32_e32 v[[B_F16_1:[0-9]+]], 16, v[[B_V2_F16]]

	; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]			; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]
	; SI-DAG: v_cvt_f32_f16_e32 v[[B_F32_1:[0-9]+]], v[[B_F16_1]]			; SI-DAG: v_cvt_f32_f16_e32 v[[B_F32_1:[0-9]+]], v[[B_F16_1]]
	▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/flat-load-clustering.mir

				# RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs -run-pass machine-scheduler -o - %s \| FileCheck -check-prefix=GCN %s

				# CGN-LABEL: name: flat_load_clustering
				# GCN: FLAT_LOAD_DWORD
				# GCN-NEXT: FLAT_LOAD_DWORD
				--- \|
				define amdgpu_kernel void @flat_load_clustering(i32 addrspace(1)* nocapture %arg, i32 addrspace(2)* nocapture readonly %arg1) {
				bb:
				%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%idxprom = sext i32 %tid to i64
				%gep1 = getelementptr inbounds i32, i32 addrspace(2)* %arg1, i64 %idxprom
				%load1 = load i32, i32 addrspace(2)* %gep1, align 4
				%gep2 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %idxprom
				%gep34 = getelementptr inbounds i32, i32 addrspace(2)* %gep1, i64 4
				%load2 = load i32, i32 addrspace(2)* %gep34, align 4
				%gep4 = getelementptr inbounds i32, i32 addrspace(1)* %gep2, i64 4
				store i32 %load1, i32 addrspace(1)* %gep2, align 4
				store i32 %load2, i32 addrspace(1)* %gep4, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x()

				...
				---
				name: flat_load_clustering
				alignment: 0
				exposesReturnsTwice: false
				legalized: false
				regBankSelected: false
				selected: false
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32 }
				- { id: 1, class: sgpr_64 }
				- { id: 2, class: vgpr_32 }
				- { id: 3, class: sreg_64_xexec }
				- { id: 4, class: sreg_64_xexec }
				- { id: 5, class: vgpr_32 }
				- { id: 6, class: vgpr_32 }
				- { id: 7, class: vgpr_32 }
				- { id: 8, class: vgpr_32 }
				- { id: 9, class: vreg_64 }
				- { id: 10, class: vreg_64 }
				- { id: 11, class: vgpr_32 }
				- { id: 12, class: vreg_64 }
				- { id: 13, class: vreg_64 }
				liveins:
				- { reg: '%vgpr0', virtual-reg: '%0' }
				- { reg: '%sgpr4_sgpr5', virtual-reg: '%1' }
				body: \|
				bb.0.bb:
				liveins: %vgpr0, %sgpr4_sgpr5

				%1 = COPY %sgpr4_sgpr5
				%0 = COPY %vgpr0
				%3 = S_LOAD_DWORDX2_IMM %1, 0, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
				%4 = S_LOAD_DWORDX2_IMM %1, 8, 0 :: (non-temporal dereferenceable invariant load 8 from `i64 addrspace(2)* undef`)
				%7 = V_LSHLREV_B32_e32 2, %0, implicit %exec
				%2 = V_MOV_B32_e32 0, implicit %exec
				undef %12.sub0 = V_ADD_I32_e32 %4.sub0, %7, implicit-def %vcc, implicit %exec
				%11 = COPY %4.sub1
				%12.sub1 = V_ADDC_U32_e32 %11, %2, implicit-def dead %vcc, implicit killed %vcc, implicit %exec
				%5 = FLAT_LOAD_DWORD %12, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4 from %ir.gep1)
				undef %9.sub0 = V_ADD_I32_e32 %3.sub0, %7, implicit-def %vcc, implicit %exec
				%8 = COPY %3.sub1
				%9.sub1 = V_ADDC_U32_e32 %8, %2, implicit-def dead %vcc, implicit killed %vcc, implicit %exec
				undef %13.sub0 = V_ADD_I32_e32 16, %12.sub0, implicit-def %vcc, implicit %exec
				%13.sub1 = V_ADDC_U32_e32 %12.sub1, %2, implicit-def dead %vcc, implicit killed %vcc, implicit %exec
				%6 = FLAT_LOAD_DWORD %13, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4 from %ir.gep34)
				undef %10.sub0 = V_ADD_I32_e32 16, %9.sub0, implicit-def %vcc, implicit %exec
				%10.sub1 = V_ADDC_U32_e32 %9.sub1, %2, implicit-def dead %vcc, implicit killed %vcc, implicit %exec
				FLAT_STORE_DWORD %9, %5, 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4 into %ir.gep2)
				FLAT_STORE_DWORD %10, %6, 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4 into %ir.gep4)
				S_ENDPGM

				...

llvm/trunk/test/CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll

	Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines
	; CI-DAG: buffer_load_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:400			; CI-DAG: buffer_load_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:400
	; CI-DAG: buffer_load_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:408			; CI-DAG: buffer_load_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:408
	; CI-DAG: buffer_store_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:12			; CI-DAG: buffer_store_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:12
	; CI-DAG: buffer_store_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:400			; CI-DAG: buffer_store_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:400
	; CI-DAG: buffer_store_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:408			; CI-DAG: buffer_store_dword {{v[0-9]+}}, off, {{s\[[0-9]+:[0-9]+\]}}, 0 offset:408
	; CI: buffer_store_dword			; CI: buffer_store_dword
	; CI: s_endpgm			; CI: s_endpgm

	; GFX9: global_load_dword {{v[0-9]+}}, v{{\[[0-9]+:[0-9]+\]}}, off offset:400			; GFX9-DAG: global_load_dword {{v[0-9]+}}, v{{\[[0-9]+:[0-9]+\]}}, off offset:400
	; GFX9: global_load_dword {{v[0-9]+}}, v{{\[[0-9]+:[0-9]+\]}}, off offset:408			; GFX9-DAG: global_load_dword {{v[0-9]+}}, v{{\[[0-9]+:[0-9]+\]}}, off offset:408
	; GFX9: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}}, off offset:12			; GFX9-DAG: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}}, off offset:12
	; GFX9: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}}, off offset:400			; GFX9-DAG: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}}, off offset:400
	; GFX9: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}}, off offset:408			; GFX9-DAG: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}}, off offset:408
	; GFX9: global_store_dword			; GFX9: global_store_dword
	; GFX9: s_endpgm			; GFX9: s_endpgm
	define amdgpu_kernel void @reorder_global_offsets(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* noalias nocapture readnone %gptr, i32 addrspace(1)* noalias nocapture %ptr0) #0 {			define amdgpu_kernel void @reorder_global_offsets(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* noalias nocapture readnone %gptr, i32 addrspace(1)* noalias nocapture %ptr0) #0 {
	%ptr1 = getelementptr inbounds i32, i32 addrspace(1)* %ptr0, i32 3			%ptr1 = getelementptr inbounds i32, i32 addrspace(1)* %ptr0, i32 3
	%ptr2 = getelementptr inbounds i32, i32 addrspace(1)* %ptr0, i32 100			%ptr2 = getelementptr inbounds i32, i32 addrspace(1)* %ptr0, i32 100
	%ptr3 = getelementptr inbounds i32, i32 addrspace(1)* %ptr0, i32 102			%ptr3 = getelementptr inbounds i32, i32 addrspace(1)* %ptr0, i32 102

	store i32 123, i32 addrspace(1)* %ptr1, align 4			store i32 123, i32 addrspace(1)* %ptr1, align 4
	▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/sub.i16.ll

Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_test_sub_i16_zext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
store i64 %ext, i64 addrspace(1)* %out		store i64 %ext, i64 addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Need to handle non-uniform case for function below (load without gep).		; FIXME: Need to handle non-uniform case for function below (load without gep).
; GCN-LABEL: {{^}}v_test_sub_i16_sext_to_i32:		; GCN-LABEL: {{^}}v_test_sub_i16_sext_to_i32:
; VI: flat_load_ushort [[A:v[0-9]+]]		; VI: flat_load_ushort [[A:v[0-9]+]]
; VI: flat_load_ushort [[B:v[0-9]+]]		; VI: flat_load_ushort [[B:v[0-9]+]]
; VI: v_sub_u16_e32 [[ADD:v[0-9]+]], [[A]], [[B]]		; VI: v_sub_u16_e32 [[ADD:v[0-9]+]], [[B]], [[A]]
; VI-NEXT: v_bfe_i32 [[SEXT:v[0-9]+]], [[ADD]], 0, 16		; VI-NEXT: v_bfe_i32 [[SEXT:v[0-9]+]], [[ADD]], 0, 16
; VI-NEXT: buffer_store_dword [[SEXT]]		; VI-NEXT: buffer_store_dword [[SEXT]]
define amdgpu_kernel void @v_test_sub_i16_sext_to_i32(i32 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {		define amdgpu_kernel void @v_test_sub_i16_sext_to_i32(i32 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.out = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 %tid		%gep.out = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 %tid
%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid		%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid
%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid		%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid
%a = load i16, i16 addrspace(1)* %gep.in0		%a = load i16, i16 addrspace(1)* %gep.in0
%b = load i16, i16 addrspace(1)* %gep.in1		%b = load i16, i16 addrspace(1)* %gep.in1
%add = sub i16 %a, %b		%add = sub i16 %a, %b
%ext = sext i16 %add to i32		%ext = sext i16 %add to i32
store i32 %ext, i32 addrspace(1)* %out		store i32 %ext, i32 addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Need to handle non-uniform case for function below (load without gep).		; FIXME: Need to handle non-uniform case for function below (load without gep).
; GCN-LABEL: {{^}}v_test_sub_i16_sext_to_i64:		; GCN-LABEL: {{^}}v_test_sub_i16_sext_to_i64:
; VI: flat_load_ushort [[A:v[0-9]+]]		; VI: flat_load_ushort [[A:v[0-9]+]]
; VI: flat_load_ushort [[B:v[0-9]+]]		; VI: flat_load_ushort [[B:v[0-9]+]]
; VI: v_sub_u16_e32 [[ADD:v[0-9]+]], [[A]], [[B]]		; VI: v_sub_u16_e32 [[ADD:v[0-9]+]], [[B]], [[A]]
; VI-NEXT: v_bfe_i32 v[[LO:[0-9]+]], [[ADD]], 0, 16		; VI-NEXT: v_bfe_i32 v[[LO:[0-9]+]], [[ADD]], 0, 16
; VI-NEXT: v_ashrrev_i32_e32 v[[HI:[0-9]+]], 31, v[[LO]]		; VI-NEXT: v_ashrrev_i32_e32 v[[HI:[0-9]+]], 31, v[[LO]]
; VI-NEXT: buffer_store_dwordx2 v{{\[}}[[LO]]:[[HI]]{{\]}}		; VI-NEXT: buffer_store_dwordx2 v{{\[}}[[LO]]:[[HI]]{{\]}}
define amdgpu_kernel void @v_test_sub_i16_sext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {		define amdgpu_kernel void @v_test_sub_i16_sext_to_i64(i64 addrspace(1)* %out, i16 addrspace(1)* %in0, i16 addrspace(1)* %in1) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid		%gep.out = getelementptr inbounds i64, i64 addrspace(1)* %out, i32 %tid
%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid		%gep.in0 = getelementptr inbounds i16, i16 addrspace(1)* %in0, i32 %tid
%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid		%gep.in1 = getelementptr inbounds i16, i16 addrspace(1)* %in1, i32 %tid
Show All 31 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Allow target to decide when to cluster loads/stores in mischedClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 115129

llvm/trunk/include/llvm/Target/TargetInstrInfo.h

llvm/trunk/lib/CodeGen/MachineScheduler.cpp

llvm/trunk/lib/Target/AArch64/AArch64InstrInfo.h

llvm/trunk/lib/Target/AArch64/AArch64InstrInfo.cpp

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp

llvm/trunk/test/CodeGen/AMDGPU/add.i16.ll

llvm/trunk/test/CodeGen/AMDGPU/ctpop.ll

llvm/trunk/test/CodeGen/AMDGPU/fadd.f16.ll

llvm/trunk/test/CodeGen/AMDGPU/flat-load-clustering.mir

llvm/trunk/test/CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll

llvm/trunk/test/CodeGen/AMDGPU/sub.i16.ll

Allow target to decide when to cluster loads/stores in misched
ClosedPublic