This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Skip unclusterd rescheduling w/o ld/st
ClosedPublic

Authored by rampitec on Feb 23 2021, 3:37 PM.

Download Raw Diff

Details

Reviewers

arsenm
kerbowa
vpykhtin
alex-t

Commits

rG635993f07bd6: [AMDGPU] Skip unclusterd rescheduling w/o ld/st

Summary

We are attempting rescheduling without load store clustering
if occupancy limits were not met with clustering. Skip this
for regions which do not have any loads or stores at all.

In a set of kernels I am experimenting with this improves
scheduling time by ~30%.

Diff Detail

Unit TestsFailed

	Time	Test
	180 ms	x64 debian > Clang.Driver::print-libgcc-file-name-clangrt.c
	1,050 ms	x64 windows > Clang.Driver::print-libgcc-file-name-clangrt.c
	80 ms	x64 windows > LLVM.CodeGen/AArch64/GlobalISel::postlegalizer-lowering-shuf-to-ins.mir

Event Timeline

rampitec created this revision.Feb 23 2021, 3:37 PM

Herald added subscribers: hiraditya, t-tye, tpr and 5 others. · View Herald TranscriptFeb 23 2021, 3:37 PM

rampitec requested review of this revision.Feb 23 2021, 3:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2021, 3:37 PM

Herald added a subscriber: wdng. · View Herald Transcript

Harbormaster completed remote builds in B90495: Diff 325917.Feb 23 2021, 6:43 PM

Check for actual clusters in the region. This is a more precise method adding a bit more of speed.

arsenm added inline comments.Feb 25 2021, 2:49 PM

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
53 ↗	(On Diff #326473)	Needs comment

Added comment.

Harbormaster completed remote builds in B90880: Diff 326473.Feb 25 2021, 3:25 PM

rampitec added a child revision: D97506: [AMDGPU] Avoid second rescheduling for some regions.Feb 25 2021, 3:53 PM

Harbormaster completed remote builds in B90913: Diff 326524.Feb 25 2021, 6:36 PM

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

In D97342#2590002, @vpykhtin wrote:

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

My problem with that it has to be done in the schedule() method or somewhere else within GCNScheduleDAGMILive. The only way to get an SUnit there is to call getSUnit() passing a MachineInstr and that is a map lookup. I.e. it is simply slower and I am trying to squeeze as much speed as I could.

In D97342#2590737, @rampitec wrote:

In D97342#2590002, @vpykhtin wrote:

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

My problem with that it has to be done in the schedule() method or somewhere else within GCNScheduleDAGMILive. The only way to get an SUnit there is to call getSUnit() passing a MachineInstr and that is a map lookup. I.e. it is simply slower and I am trying to squeeze as much speed as I could.

Actually it is not even possible. The place where I can do it does not have mutations applied yet.

In D97342#2590876, @rampitec wrote:

In D97342#2590737, @rampitec wrote:

In D97342#2590002, @vpykhtin wrote:

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

My problem with that it has to be done in the schedule() method or somewhere else within GCNScheduleDAGMILive. The only way to get an SUnit there is to call getSUnit() passing a MachineInstr and that is a map lookup. I.e. it is simply slower and I am trying to squeeze as much speed as I could.

Actually it is not even possible. The place where I can do it does not have mutations applied yet.

In fact I did the experiment, it may look better but I've got 3.6% slower scheduling with that separate loop.

LGTM

This revision is now accepted and ready to land.Feb 26 2021, 12:06 PM

vpykhtin mentioned this in D97506: [AMDGPU] Avoid second rescheduling for some regions.Feb 26 2021, 12:08 PM

This revision was landed with ongoing or failed builds.Feb 26 2021, 12:29 PM

Closed by commit rG635993f07bd6: [AMDGPU] Skip unclusterd rescheduling w/o ld/st (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rG635993f07bd6: [AMDGPU] Skip unclusterd rescheduling w/o ld/st.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

GCNSchedStrategy.cpp

7 lines

Diff 325917

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Show First 20 Lines • Show All 298 Lines • ▼ Show 20 Lines	void GCNScheduleDAGMILive::schedule() {
if (Stage == Collect) {		if (Stage == Collect) {
// Just record regions at the first pass.		// Just record regions at the first pass.
Regions.push_back(std::make_pair(RegionBegin, RegionEnd));		Regions.push_back(std::make_pair(RegionBegin, RegionEnd));
return;		return;
}		}

std::vector<MachineInstr*> Unsched;		std::vector<MachineInstr*> Unsched;
Unsched.reserve(NumRegionInstrs);		Unsched.reserve(NumRegionInstrs);
		bool SeenLdSt = false;
for (auto &I : *this) {		for (auto &I : *this) {
		SeenLdSt \|= I.mayLoadOrStore();
Unsched.push_back(&I);		Unsched.push_back(&I);
}		}

GCNRegPressure PressureBefore;		GCNRegPressure PressureBefore;
if (LIS) {		if (LIS) {
PressureBefore = Pressure[RegionIdx];		PressureBefore = Pressure[RegionIdx];

LLVM_DEBUG(dbgs() << "Pressure before scheduling:\nRegion live-ins:";		LLVM_DEBUG(dbgs() << "Pressure before scheduling:\nRegion live-ins:";
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	void GCNScheduleDAGMILive::schedule() {
if (WavesAfter >= MinOccupancy) {		if (WavesAfter >= MinOccupancy) {
if (Stage == UnclusteredReschedule &&		if (Stage == UnclusteredReschedule &&
!PressureAfter.less(ST, PressureBefore)) {		!PressureAfter.less(ST, PressureBefore)) {
LLVM_DEBUG(dbgs() << "Unclustered reschedule did not help.\n");		LLVM_DEBUG(dbgs() << "Unclustered reschedule did not help.\n");
} else if (WavesAfter > MFI.getMinWavesPerEU() \|\|		} else if (WavesAfter > MFI.getMinWavesPerEU() \|\|
PressureAfter.less(ST, PressureBefore) \|\|		PressureAfter.less(ST, PressureBefore) \|\|
!RescheduleRegions[RegionIdx]) {		!RescheduleRegions[RegionIdx]) {
Pressure[RegionIdx] = PressureAfter;		Pressure[RegionIdx] = PressureAfter;
		if (!SeenLdSt && (Stage + 1) == UnclusteredReschedule)
		RescheduleRegions[RegionIdx] = false;
return;		return;
} else {		} else {
LLVM_DEBUG(dbgs() << "New pressure will result in more spilling.\n");		LLVM_DEBUG(dbgs() << "New pressure will result in more spilling.\n");
}		}
}		}

LLVM_DEBUG(dbgs() << "Attempting to revert scheduling.\n");		LLVM_DEBUG(dbgs() << "Attempting to revert scheduling.\n");
RescheduleRegions[RegionIdx] = true;		RescheduleRegions[RegionIdx] = SeenLdSt \|\|
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - RescheduleRegions[RegionIdx] = SeenLdSt \|\| - (Stage + 1) != UnclusteredReschedule; + RescheduleRegions[RegionIdx] = + SeenLdSt \|\| (Stage + 1) != UnclusteredReschedule; Lint: Pre-merge checks: clang-format: please reformat the code ``` - RescheduleRegions[RegionIdx] = SeenLdSt \|\|…
		(Stage + 1) != UnclusteredReschedule;
RegionEnd = RegionBegin;		RegionEnd = RegionBegin;
for (MachineInstr *MI : Unsched) {		for (MachineInstr *MI : Unsched) {
if (MI->isDebugInstr())		if (MI->isDebugInstr())
continue;		continue;

if (MI->getIterator() != RegionEnd) {		if (MI->getIterator() != RegionEnd) {
BB->remove(MI);		BB->remove(MI);
BB->insert(RegionEnd, MI);		BB->insert(RegionEnd, MI);
▲ Show 20 Lines • Show All 213 Lines • Show Last 20 Lines