This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable load clustering in the post-RA scheduler
ClosedPublic

Authored by foad on Oct 12 2021, 7:49 AM.

Details

Summary

This has a couple of benefits:

  1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy.
  2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:

  • The average length of each run of one or more load instructions increased by about 1%.
  • The number of runs of two or more load instructions increased by about 4%.

Diff Detail

Event Timeline

foad created this revision.Oct 12 2021, 7:49 AM
foad requested review of this revision.Oct 12 2021, 7:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 7:49 AM

Interesting to see that quite a few of the test changes are clustering the loads, but not ending up with a clause (correctly).
Nice result otherwise though.

Interesting to see that quite a few of the test changes are clustering the loads, but not ending up with a clause (correctly).
Nice result otherwise though.

Oh no - they're just pre-gfx10, so no s_clause.

Any data on compile time? Other than potential compile time/ runtime tradeoff LGTM

This revision is now accepted and ready to land.Oct 12 2021, 10:27 AM
foad added a comment.Oct 13 2021, 3:52 AM

Any data on compile time? Other than potential compile time/ runtime tradeoff LGTM

We see an average 0.11% slow down in internal compile time testing on a bunch of graphics pipelines. Does that seem acceptable?

Joe_Nash accepted this revision.Oct 13 2021, 8:25 AM

Any data on compile time? Other than potential compile time/ runtime tradeoff LGTM

We see an average 0.11% slow down in internal compile time testing on a bunch of graphics pipelines. Does that seem acceptable?

Seems acceptable to me.

This revision was automatically updated to reflect the committed changes.

Sounds good to me. The runtime improvement from clustering is notoriously difficult to assess, but your static data shows some potential benefit.

Having said that, it probably makes sense to guard the mutation by the optlevel check, so we only enable it with -O2 or higher.

OptLevel > CodeGenOpt::Less
foad added a comment.Oct 14 2021, 2:32 AM

Sounds good to me. The runtime improvement from clustering is notoriously difficult to assess, but your static data shows some potential benefit.

Having said that, it probably makes sense to guard the mutation by the optlevel check, so we only enable it with -O2 or higher.

Interesting. I wonder if we should do that for all the mutations, or even all the (post-ra?) scheduling?