This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Skip unclusterd rescheduling w/o ld/st
ClosedPublic

Authored by rampitec on Feb 23 2021, 3:37 PM.

Details

Summary

We are attempting rescheduling without load store clustering
if occupancy limits were not met with clustering. Skip this
for regions which do not have any loads or stores at all.

In a set of kernels I am experimenting with this improves
scheduling time by ~30%.

Diff Detail

Event Timeline

rampitec created this revision.Feb 23 2021, 3:37 PM
rampitec requested review of this revision.Feb 23 2021, 3:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2021, 3:37 PM
Herald added a subscriber: wdng. · View Herald Transcript
rampitec updated this revision to Diff 326473.Feb 25 2021, 1:22 PM

Check for actual clusters in the region. This is a more precise method adding a bit more of speed.

arsenm added inline comments.Feb 25 2021, 2:49 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
53

Needs comment

rampitec updated this revision to Diff 326524.Feb 25 2021, 3:08 PM
rampitec marked an inline comment as done.

Added comment.

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

My problem with that it has to be done in the schedule() method or somewhere else within GCNScheduleDAGMILive. The only way to get an SUnit there is to call getSUnit() passing a MachineInstr and that is a map lookup. I.e. it is simply slower and I am trying to squeeze as much speed as I could.

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

My problem with that it has to be done in the schedule() method or somewhere else within GCNScheduleDAGMILive. The only way to get an SUnit there is to call getSUnit() passing a MachineInstr and that is a map lookup. I.e. it is simply slower and I am trying to squeeze as much speed as I could.

Actually it is not even possible. The place where I can do it does not have mutations applied yet.

Looks good, but should we use just a single dedicated pass over SUs to check if there're clustered ops after first scheduling to make the logic slightly easier?

My problem with that it has to be done in the schedule() method or somewhere else within GCNScheduleDAGMILive. The only way to get an SUnit there is to call getSUnit() passing a MachineInstr and that is a map lookup. I.e. it is simply slower and I am trying to squeeze as much speed as I could.

Actually it is not even possible. The place where I can do it does not have mutations applied yet.

In fact I did the experiment, it may look better but I've got 3.6% slower scheduling with that separate loop.

This revision is now accepted and ready to land.Feb 26 2021, 12:06 PM
This revision was landed with ongoing or failed builds.Feb 26 2021, 12:29 PM
This revision was automatically updated to reflect the committed changes.