This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases
ClosedPublic

Authored by foad on Feb 4 2022, 8:31 AM.

Details

Summary

Previously when combining two loads this pass would sink the
first one down to the second one, putting the combined load
where the second one was. It would also sink any intervening
instructions which depended on the first load down to just
after the combined load.

For example, if we started with this sequence of
instructions (code flowing from left to right):

X A B C D E F Y

After combining loads X and Y into XY we might end up with:

A B C D E F XY

But if B D and F depended on X, we would get:

A C E XY B D F

Now if the original code had some short disjoint live ranges
from A to B, C to D and E to F, in the transformed code
these live ranges will be long and overlapping. In this way
a single merge of two loads could cause an unbounded
increase in register pressure.

To fix this, change the way the way that loads are moved in
order to merge them so that:

  • The second load is moved up to the first one. (But when merging stores, we still move the first store down to the second one.)
  • Intervening instructions are never moved.
  • Instead, if we find an intervening instruction that would need to be moved, give up on the merge. But this case should now be pretty rare because normal stores have no outputs, and normal loads only have address register inputs, but these will be identical for any pair of loads that we try to merge.

As well as fixing the unbounded register pressure increase
problem, moving loads up and stores down seems like it
should usually be a win for memory latency reasons.

Diff Detail

Event Timeline

foad created this revision.Feb 4 2022, 8:31 AM
foad requested review of this revision.Feb 4 2022, 8:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2022, 8:31 AM

Doesn't moving a load higher and a store lower increase register pressure itself?

foad added a comment.Feb 7 2022, 3:16 AM

Doesn't moving a load higher and a store lower increase register pressure itself?

Yes, but in a very controlled way: exactly one live range gets lengthened. I'm trying to avoid the kind of crzay problems we've seen where merging one pair of loads increases the pressure by several hundred vgprs.

foad added a comment.Feb 7 2022, 9:01 AM

I statically compiled 10,000 graphics shaders to look at the effect of this patch on register pressure.
309 shaders had their overall vgpr count reduced, by an average of 2.9 vgprs.
310 shaders had their overal vgpr count increased, by an average of 2.5 vgprs.

piotr added a comment.Feb 7 2022, 9:19 AM

So that indicates an improvement in the average vgpr count.

Btw, does the patch result in a noticeably smaller number of merges on average?

arsenm accepted this revision.Feb 7 2022, 11:13 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605

Generally those don't exist at this point outside of copies

This revision is now accepted and ready to land.Feb 7 2022, 11:13 AM
foad added a comment.Feb 7 2022, 11:56 AM

So that indicates an improvement in the average vgpr count.

Yes.

Btw, does the patch result in a noticeably smaller number of merges on average?

No, to my surprise there was absolutely no difference in the amount of merging in any of the 10,000 shaders. I checked by diffing the instruction mix for each shader. The only differences were in VALU and SALU instructions.

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605

What don't? Subregs or partial physreg overlaps or both?

arsenm added inline comments.Feb 7 2022, 11:59 AM
llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605

Both. In SSA those generally appear as plain copies to do the extract or copy from physreg

piotr added a comment.Feb 7 2022, 11:48 PM

So that indicates an improvement in the average vgpr count.

Yes.

Btw, does the patch result in a noticeably smaller number of merges on average?

No, to my surprise there was absolutely no difference in the amount of merging in any of the 10,000 shaders. I checked by diffing the instruction mix for each shader. The only differences were in VALU and SALU instructions.

Cool, that's even better - it means the improved vgpr count was not achieved at the expense of the actual merging.

foad updated this revision to Diff 409647.Feb 17 2022, 7:20 AM

Remove comments about subregs.

foad marked an inline comment as done.Feb 17 2022, 7:21 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
604–605

OK, I just removed the comments.

foad marked an inline comment as done.Feb 17 2022, 7:22 AM

Performance testing on graphics frames showed no nasty surprises (and no big improvements either).

This revision was landed with ongoing or failed builds.Feb 21 2022, 2:51 AM
This revision was automatically updated to reflect the committed changes.