This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Correct NumLoads in clustering
ClosedPublic

Authored by rampitec on Jan 23 2020, 1:27 PM.

Download Raw Diff

Details

Reviewers

foad
dstuttard
kerbowa
vpykhtin
t.p.northover

Commits

rGbe8e38cbd978: Correct NumLoads in clustering

Summary

Scheduler sends NumLoads argument into shouldClusterMemOps()
one less the actual cluster length. So for 2 instructions
it will pass just 1. Correct this number.

This is NFC for in tree targets.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Jan 23 2020, 1:27 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2020, 1:27 PM

Herald added subscribers: hiraditya, t-tye, tpr and 6 others. · View Herald Transcript

I tried something similar in D72325.

In D73292#1837215, @foad wrote:

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I also have an early version of a patch which can reschedule without clustering if no optimal schedule were found for a region.

We may postpone this change, use that patch and then finally correct this number while increasing clustering threshold at the same time. It may be easier in terms of tests update and less prone to performance regressions.

In D73292#1837231, @rampitec wrote:

In D73292#1837215, @foad wrote:

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I agree. I also think we should fix this properly in MachineScheduler:

--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -1584,7 +1584,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     SUnit *SUb = MemOpRecords[Idx+1].SU;
     if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
                                  MemOpRecords[Idx + 1].BaseOps,
-                                 ClusterLength)) {
+                                 ClusterLength + 1)) {
       if (SUa->NodeNum > SUb->NodeNum)
         std::swap(SUa, SUb);
       if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {

... and adjust any other target implementations of shouldClusterMemOps accordingly.

In D73292#1838140, @foad wrote:
In D73292#1837231, @rampitec wrote:

In D73292#1837215, @foad wrote:

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I agree. I also think we should fix this properly in MachineScheduler:
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -1584,7 +1584,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     SUnit *SUb = MemOpRecords[Idx+1].SU;
     if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
                                  MemOpRecords[Idx + 1].BaseOps,
-                                 ClusterLength)) {
+                                 ClusterLength + 1)) {
       if (SUa->NodeNum > SUb->NodeNum)
         std::swap(SUa, SUb);
       if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
... and adjust any other target implementations of shouldClusterMemOps accordingly.

That I agree too. I just cannot deal with any single performance regression in all of the targets, including out of tree targets. This bug is here at least for 6 years when I first noticed it. I belive every in-tree target need to do the same as in this patch and then we can collectively switch away from this bug. Sounds like the only discussion is about the process how to get there.

I just cannot deal with any single performance regression in all of the targets, including out of tree targets.

We can adjust AArch64 (the only other affected in-tree target) so there's no change in behaviour, and out-of-tree target maintainers can adjust their own targets.

LGTM.

This revision is now accepted and ready to land.Jan 24 2020, 6:26 AM

Changed patch to only correct clusterNeighboringMemOps. Threshold logic adjusted accordingly.
We will need to retune our threshold in the AMDGPU separately, likely after scheduler will gain the ability to break clustering under a high pressure.

Herald added subscribers: javed.absar, MatzeB. · View Herald TranscriptJan 24 2020, 12:07 PM

rampitec requested review of this revision.Jan 24 2020, 12:07 PM

LGTM! Perhaps add a comment in TargetInstrInfo.h documenting the argument? E.g. "the number of loads that will be in the cluster if this hook returns true"...?

This revision is now accepted and ready to land.Jan 24 2020, 12:21 PM

Added operand description to TargetInstrInfo::shouldClusterMemOps()

Closed by commit rGbe8e38cbd978: Correct NumLoads in clustering (authored by rampitec). · Explain WhyJan 24 2020, 12:46 PM

This revision was automatically updated to reflect the committed changes.

foad mentioned this in D72325: [AMDGPU] Fix cluster size threshold calculation.Mar 24 2020, 3:40 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetInstrInfo.h

4 lines

lib/

CodeGen/

MachineScheduler.cpp

2 lines

Target/

AArch64/

AArch64InstrInfo.cpp

2 lines

AMDGPU/

SIInstrInfo.cpp

8 lines

Diff 240274

llvm/include/llvm/CodeGen/TargetInstrInfo.h

Show First 20 Lines • Show All 1,270 Lines • ▼ Show 20 Lines	public:
}		}

/// Returns true if the two given memory operations should be scheduled		/// Returns true if the two given memory operations should be scheduled
/// adjacent. Note that you have to add:		/// adjacent. Note that you have to add:
/// DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));		/// DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
/// or		/// or
/// DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));		/// DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
/// to TargetPassConfig::createMachineScheduler() to have an effect.		/// to TargetPassConfig::createMachineScheduler() to have an effect.
		///
		/// \p BaseOps1 and \p BaseOps2 are memory operands of two memory operations.
		/// \p NumLoads is the number of loads that will be in the cluster if this
		/// hook returns true.
virtual bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,		virtual bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
ArrayRef<const MachineOperand *> BaseOps2,		ArrayRef<const MachineOperand *> BaseOps2,
unsigned NumLoads) const {		unsigned NumLoads) const {
llvm_unreachable("target did not implement shouldClusterMemOps()");		llvm_unreachable("target did not implement shouldClusterMemOps()");
}		}

/// Reverses the branch condition of the specified condition list,		/// Reverses the branch condition of the specified condition list,
/// returning false on success and true if it cannot be reversed.		/// returning false on success and true if it cannot be reversed.
▲ Show 20 Lines • Show All 578 Lines • Show Last 20 Lines

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,578 Lines • ▼ Show 20 Lines	#endif

llvm::sort(MemOpRecords);		llvm::sort(MemOpRecords);
unsigned ClusterLength = 1;		unsigned ClusterLength = 1;
for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {		for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {
SUnit *SUa = MemOpRecords[Idx].SU;		SUnit *SUa = MemOpRecords[Idx].SU;
SUnit *SUb = MemOpRecords[Idx+1].SU;		SUnit *SUb = MemOpRecords[Idx+1].SU;
if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,		if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
MemOpRecords[Idx + 1].BaseOps,		MemOpRecords[Idx + 1].BaseOps,
ClusterLength)) {		ClusterLength + 1)) {
if (SUa->NodeNum > SUb->NodeNum)		if (SUa->NodeNum > SUb->NodeNum)
std::swap(SUa, SUb);		std::swap(SUa, SUb);
if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {		if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("		LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
<< SUb->NodeNum << ")\n");		<< SUb->NodeNum << ")\n");
// Copy successor edges from SUa to SUb. Interleaving computation		// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.		// dependent on SUa can prevent load combining due to register reuse.
// Predecessor edges do not need to be copied from SUb to SUa since		// Predecessor edges do not need to be copied from SUb to SUa since
▲ Show 20 Lines • Show All 2,176 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

Show First 20 Lines • Show All 2,416 Lines • ▼ Show 20 Lines	bool AArch64InstrInfo::shouldClusterMemOps(
assert((BaseOp1.isReg() \|\| BaseOp1.isFI()) &&		assert((BaseOp1.isReg() \|\| BaseOp1.isFI()) &&
"Only base registers and frame indices are supported.");		"Only base registers and frame indices are supported.");

// Check for both base regs and base FI.		// Check for both base regs and base FI.
if (BaseOp1.isReg() && BaseOp1.getReg() != BaseOp2.getReg())		if (BaseOp1.isReg() && BaseOp1.getReg() != BaseOp2.getReg())
return false;		return false;

// Only cluster up to a single pair.		// Only cluster up to a single pair.
if (NumLoads > 1)		if (NumLoads > 2)
return false;		return false;

if (!isPairableLdStInst(FirstLdSt) \|\| !isPairableLdStInst(SecondLdSt))		if (!isPairableLdStInst(FirstLdSt) \|\| !isPairableLdStInst(SecondLdSt))
return false;		return false;

// Can we pair these instructions based on their opcodes?		// Can we pair these instructions based on their opcodes?
unsigned FirstOpc = FirstLdSt.getOpcode();		unsigned FirstOpc = FirstLdSt.getOpcode();
unsigned SecondOpc = SecondLdSt.getOpcode();		unsigned SecondOpc = SecondLdSt.getOpcode();
▲ Show 20 Lines • Show All 4,238 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 451 Lines • ▼ Show 20 Lines	if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOp1, SecondLdSt, BaseOp2))
return false;		return false;

const MachineOperand *FirstDst = nullptr;		const MachineOperand *FirstDst = nullptr;
const MachineOperand *SecondDst = nullptr;		const MachineOperand *SecondDst = nullptr;

if ((isMUBUF(FirstLdSt) && isMUBUF(SecondLdSt)) \|\|		if ((isMUBUF(FirstLdSt) && isMUBUF(SecondLdSt)) \|\|
(isMTBUF(FirstLdSt) && isMTBUF(SecondLdSt)) \|\|		(isMTBUF(FirstLdSt) && isMTBUF(SecondLdSt)) \|\|
(isFLAT(FirstLdSt) && isFLAT(SecondLdSt))) {		(isFLAT(FirstLdSt) && isFLAT(SecondLdSt))) {
const unsigned MaxGlobalLoadCluster = 6;		const unsigned MaxGlobalLoadCluster = 7;
if (NumLoads > MaxGlobalLoadCluster)		if (NumLoads > MaxGlobalLoadCluster)
return false;		return false;

FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdata);		FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdata);
if (!FirstDst)		if (!FirstDst)
FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdst);		FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdst);
SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdata);		SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdata);
if (!SecondDst)		if (!SecondDst)
Show All 23 Lines	const MachineRegisterInfo &MRI =
FirstLdSt.getParent()->getParent()->getRegInfo();		FirstLdSt.getParent()->getParent()->getRegInfo();

const Register Reg = FirstDst->getReg();		const Register Reg = FirstDst->getReg();

const TargetRegisterClass *DstRC = Register::isVirtualRegister(Reg)		const TargetRegisterClass *DstRC = Register::isVirtualRegister(Reg)
? MRI.getRegClass(Reg)		? MRI.getRegClass(Reg)
: RI.getPhysRegClass(Reg);		: RI.getPhysRegClass(Reg);

return (NumLoads * (RI.getRegSizeInBits(*DstRC) / 8)) <= LoadClusterThreshold;		// FIXME: NumLoads should not be subtracted 1. This is to match behavior
		// of clusterNeighboringMemOps which was previosly passing cluster length
		// less 1. LoadClusterThreshold should be tuned instead.
		return ((NumLoads - 1) * (RI.getRegSizeInBits(*DstRC) / 8)) <=
		LoadClusterThreshold;
}		}

// FIXME: This behaves strangely. If, for example, you have 32 load + stores,		// FIXME: This behaves strangely. If, for example, you have 32 load + stores,
// the first 16 loads will be interleaved with the stores, and the next 16 will		// the first 16 loads will be interleaved with the stores, and the next 16 will
// be clustered as expected. It should really split into 2 16 store batches.		// be clustered as expected. It should really split into 2 16 store batches.
//		//
// Loads are clustered until this returns false, rather than trying to schedule		// Loads are clustered until this returns false, rather than trying to schedule
// groups of stores. This also means we have to deal with saying different		// groups of stores. This also means we have to deal with saying different
▲ Show 20 Lines • Show All 6,236 Lines • Show Last 20 Lines