This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Don't cluster stores
ClosedPublic

Authored by foad on Aug 7 2020, 8:31 AM.

Details

Summary

Clustering loads has caching benefits, but as far as I know there is no
advantage to clustering stores on any AMDGPU subtargets.

The disadvantage is that it tends to increase register pressure and
restricts scheduling freedom.

Diff Detail

Event Timeline

foad created this revision.Aug 7 2020, 8:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2020, 8:31 AM
foad requested review of this revision.Aug 7 2020, 8:31 AM
foad added a comment.Aug 7 2020, 9:11 AM

Some statistics for this change, from statically compiling 9785 graphics shaders from games for GFX9:
average vgpr_count decreases by 0.04%
average sgpr_count decreases by 0.001%
total number of instructions decreases by 21 (0.0004%)
number of s_waitcnt instructions decreases by 17 (0.0003%)
So the numbers are tiny but at least they moved in the right direction!

I think we should have benefit of write combining. Do you have any performance numbers? That shall be deciding point.

critson added a subscriber: critson.Aug 8 2020, 6:25 PM

I tried this change with game traces on GFX10.

I could not convince myself that there was any statically significant changes in performance.
Some small gains, some small losses, nothing outside the range of variance.

I do however wonder if this would be better as a tuning option?
The same could also be said for load clustering as essentially you are trading VGPR pressure for VMEM access efficiency (e.g. stalls).

This revision is now accepted and ready to land.Aug 10 2020, 11:41 AM
This revision was automatically updated to reflect the committed changes.

I tried this change with game traces on GFX10.

I could not convince myself that there was any statically significant changes in performance.
Some small gains, some small losses, nothing outside the range of variance.

I do however wonder if this would be better as a tuning option?
The same could also be said for load clustering as essentially you are trading VGPR pressure for VMEM access efficiency (e.g. stalls).

Did you try this with xnack enabled? This will reduce the number of soft clauses formed for stores

foad added a comment.Feb 13 2021, 5:25 AM

Did you try this with xnack enabled?

Probably not. I don't think xnack is enabled for any of the platforms we usually care about for Vulkan graphics.

This will reduce the number of soft clauses formed for stores

Is that a problem, and if so why?