This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/CodeGen/
-
lib/
-
CodeGen/
-
MachineScheduler.cpp

Differential D80119

[AMDGPU/MemOpsCluster] Code clean-up around mem ops clustering logic
ClosedPublic

Authored by hsmhsm on May 18 2020, 5:25 AM.

Download Raw Diff

Details

Reviewers

foad
rampitec
arsenm
vpykhtin
javedabsar

Commits

rG09f7dcb64e1b: [AMDGPU/MemOpsCluster] Code clean-up around mem ops clustering logic

Summary

Clean-up code around mem ops clustering logic. This patch cleans up code within
the function clusterNeighboringMemOps(). It is WIP, and this patch is a first cut.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hsmhsm created this revision.May 18 2020, 5:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2020, 5:25 AM

Herald added subscribers: llvm-commits, kerbowa, javed.absar and 10 others. · View Herald Transcript

This causes lots of test failures:

Failing Tests (35):
  LLVM :: CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.scale.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ubfe.ll
  LLVM :: CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
  LLVM :: CodeGen/AMDGPU/amdhsa-trap-num-sgprs.ll
  LLVM :: CodeGen/AMDGPU/bitreverse.ll
  LLVM :: CodeGen/AMDGPU/call-argument-types.ll
  LLVM :: CodeGen/AMDGPU/cvt_f32_ubyte.ll
  LLVM :: CodeGen/AMDGPU/fshr.ll
  LLVM :: CodeGen/AMDGPU/idot2.ll
  LLVM :: CodeGen/AMDGPU/idot4s.ll
  LLVM :: CodeGen/AMDGPU/idot4u.ll
  LLVM :: CodeGen/AMDGPU/idot8s.ll
  LLVM :: CodeGen/AMDGPU/idot8u.ll
  LLVM :: CodeGen/AMDGPU/insert_vector_dynelt.ll
  LLVM :: CodeGen/AMDGPU/insert_vector_elt.v2i16.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll
  LLVM :: CodeGen/AMDGPU/llvm.maxnum.f16.ll
  LLVM :: CodeGen/AMDGPU/llvm.minnum.f16.ll
  LLVM :: CodeGen/AMDGPU/llvm.round.f64.ll
  LLVM :: CodeGen/AMDGPU/memory_clause.ll
  LLVM :: CodeGen/AMDGPU/merge-stores.ll
  LLVM :: CodeGen/AMDGPU/promote-constOffset-to-imm.ll
  LLVM :: CodeGen/AMDGPU/salu-to-valu.ll
  LLVM :: CodeGen/AMDGPU/sdiv.ll
  LLVM :: CodeGen/AMDGPU/sdiv64.ll
  LLVM :: CodeGen/AMDGPU/setcc-limit-load-shrink.ll
  LLVM :: CodeGen/AMDGPU/sgpr-control-flow.ll
  LLVM :: CodeGen/AMDGPU/smrd.ll
  LLVM :: CodeGen/AMDGPU/srem64.ll
  LLVM :: CodeGen/AMDGPU/trunc-combine.ll
  LLVM :: CodeGen/AMDGPU/udiv64.ll
  LLVM :: CodeGen/AMDGPU/urem64.ll
  LLVM :: CodeGen/AMDGPU/use-sgpr-multiple-times.ll


Testing Time: 358.21s
  Unsupported Tests  :   102
  Expected Passes    : 15421
  Expected Failures  :    49
  Unexpected Failures:    35

Harbormaster failed remote builds in B57057: Diff 264595!May 18 2020, 6:57 AM

(1) NumLoads argument can initially be set to 2 instead of 1, this avoids incrementing NumLoads while calling shouldClusterMemOps()

Alternatively you could move ++ClusterLength to just before the call to shouldClusterMemOps. But I really don't think this is very important.

(2) Re-arrange the code within clusterNeighboringMemOps(), so that the code is more convenient to understand

I agree that using continue is more readable.

(3) Improve the logic within shouldClusterMemOps() and remove all FIXMEs

You've changed it to accurately count bytes in a cluster, which is nice, but (a) it's a change in behvaiour so needs test updates and benchmarking, and (b) I don't particularly like the implementation. I think a cleaner way to do this would be to change all target getMemOperandsWithOffset functions to return a Width (in bytes) as well (some targets already have an internal getMemOperandsWithOffsetWidth function which does this) and change MachineScheduler to use this information to track the size of a cluster, and pass that information into shouldClusterMemOps.

Since it is safe to do clean-up step-by-step, the previous somewhat larger
patch is reverted back, and in this modified patch, initial step is taken
towards the clean-up.

Hi Jay,

I would not want to submit bigger patch since it is bit risky. So, let's take it step-by-step at time. This modified patch is the first cut.

Harbormaster completed remote builds in B57670: Diff 265782.May 22 2020, 2:30 PM

Looks OK to me. Please also update the "summary" section at the top of this review (unfortunately "arc diff" does not do this for you when you edit the git commit message, so you have to do it by hand).

This revision is now accepted and ready to land.May 26 2020, 1:22 AM

hsmhsm edited the summary of this revision. (Show Details)May 26 2020, 3:13 AM

Closed by commit rG09f7dcb64e1b: [AMDGPU/MemOpsCluster] Code clean-up around mem ops clustering logic (authored by hsmhsm). · Explain WhyMay 26 2020, 3:45 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineScheduler.cpp

64 lines

Diff 266154

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,574 Lines • ▼ Show 20 Lines	#ifndef NDEBUG
for (auto *Op : BaseOps)		for (auto *Op : BaseOps)
assert(Op);		assert(Op);
#endif		#endif
}		}
if (MemOpRecords.size() < 2)		if (MemOpRecords.size() < 2)
return;		return;

llvm::sort(MemOpRecords);		llvm::sort(MemOpRecords);

		// At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
		// cluster mem ops collected within `MemOpRecords` array.
unsigned ClusterLength = 1;		unsigned ClusterLength = 1;
for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {		for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {
SUnit *SUa = MemOpRecords[Idx].SU;		// Decision to cluster mem ops is taken based on target dependent logic
SUnit *SUb = MemOpRecords[Idx+1].SU;		auto MemOpa = MemOpRecords[Idx];
if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,		auto MemOpb = MemOpRecords[Idx + 1];
MemOpRecords[Idx + 1].BaseOps,		++ClusterLength;
ClusterLength + 1)) {		if (!TII->shouldClusterMemOps(MemOpa.BaseOps, MemOpb.BaseOps,
		ClusterLength)) {
		// Current mem ops pair could not be clustered, reset cluster length, and
		// go to next pair
		ClusterLength = 1;
		continue;
		}

		SUnit *SUa = MemOpa.SU;
		SUnit *SUb = MemOpb.SU;
if (SUa->NodeNum > SUb->NodeNum)		if (SUa->NodeNum > SUb->NodeNum)
std::swap(SUa, SUb);		std::swap(SUa, SUb);
if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
		// FIXME: Is this check really required?
		if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
		ClusterLength = 1;
		continue;
		}

LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("		LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
<< SUb->NodeNum << ")\n");		<< SUb->NodeNum << ")\n");

// Copy successor edges from SUa to SUb. Interleaving computation		// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.		// dependent on SUa can prevent load combining due to register reuse.
// Predecessor edges do not need to be copied from SUb to SUa since		// Predecessor edges do not need to be copied from SUb to SUa since
// nearby loads should have effectively the same inputs.		// nearby loads should have effectively the same inputs.
for (const SDep &Succ : SUa->Succs) {		for (const SDep &Succ : SUa->Succs) {
if (Succ.getSUnit() == SUb)		if (Succ.getSUnit() == SUb)
continue;		continue;
LLVM_DEBUG(dbgs()		LLVM_DEBUG(dbgs() << " Copy Succ SU(" << Succ.getSUnit()->NodeNum
<< " Copy Succ SU(" << Succ.getSUnit()->NodeNum << ")\n");		<< ")\n");
DAG->addEdge(Succ.getSUnit(), SDep(SUb, SDep::Artificial));		DAG->addEdge(Succ.getSUnit(), SDep(SUb, SDep::Artificial));
}		}
++ClusterLength;
} else
ClusterLength = 1;
} else
ClusterLength = 1;
}		}
}		}

/// Callback from DAG postProcessing to create cluster edges for loads.		/// Callback from DAG postProcessing to create cluster edges for loads.
void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {		void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {
// Map DAG NodeNum to a set of dependent MemOps in store chain.		// Map DAG NodeNum to a set of dependent MemOps in store chain.
DenseMap<unsigned, SmallVector<SUnit *, 4>> StoreChains;		DenseMap<unsigned, SmallVector<SUnit *, 4>> StoreChains;
for (SUnit &SU : DAG->SUnits) {		for (SUnit &SU : DAG->SUnits) {
▲ Show 20 Lines • Show All 2,158 Lines • Show Last 20 Lines