This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Break Large PHIs: Take whole PHI chains into account
ClosedPublic

Authored by Pierre-vh on Jul 27 2023, 3:15 AM.

Details

Summary

Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain".
The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector.
We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations.

The solution I found is to consider the whole "PHI chain" when looking at PHI.
That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole.

The currrent threshold requires that at least ceil(chain size * (2/3)) PHIs have at least one interesting incoming value.
In simple terms, two-thirds (rounded up) of the PHIs should be breakable.

This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue.

Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487

Diff Detail

Event Timeline

Pierre-vh created this revision.Jul 27 2023, 3:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2023, 3:16 AM
Pierre-vh requested review of this revision.Jul 27 2023, 3:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2023, 3:16 AM

Note that I checked this fixes the rocRAND threefry regression, which was the last remaining one.
The threshold was found experimentally by checking rocRAND performance results. IIRC, 50% was good for MI100, but caused regressions on MI200 in a few tests. 66% seems like a good middleground that still benefits rocRAND on MI100 without hurting MI200.
I also think that threshold makes sense. We want to break all of the PHIs only when there is a very strong indication that it'll be beneficial. 50% is too much of a coinflip.

arsenm added inline comments.Jul 28 2023, 7:51 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1831

Try to avoid FP for this?

1836

don't understand the point of the metadata, adding metadata only for a single test seems like a bad idea

Pierre-vh marked 2 inline comments as done.

Remove MD, avoid FP math

arsenm added inline comments.Jul 31 2023, 4:43 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
34

Dead include

rm dead include

Pierre-vh marked an inline comment as done.Jul 31 2023, 11:39 PM

rm dead MD in the test too

arsenm added inline comments.Aug 3 2023, 6:58 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1842

Can you use isInterestingPHIIncomingValue directly without the lambda?

Pierre-vh updated this revision to Diff 546850.Aug 3 2023, 7:10 AM

Address comment

arsenm added inline comments.Aug 3 2023, 7:22 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
444

Random formatting change

556–557

Random formatting changes

Pierre-vh updated this revision to Diff 546853.Aug 3 2023, 7:24 AM

Remove formatting changes

arsenm accepted this revision.Aug 3 2023, 7:31 AM

I've been debating just introduce a late large register tuple splitting pass

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1809

These could probably be SmallPtrSets

This revision is now accepted and ready to land.Aug 3 2023, 7:31 AM
Pierre-vh updated this revision to Diff 546865.Aug 3 2023, 7:40 AM
Pierre-vh marked an inline comment as done.

Switch to SmallPtrSet

Pierre-vh added a comment.EditedAug 3 2023, 7:40 AM

I've also been wondering if this transform should just be a separate pass that runs very late, right before ISel, to avoid any interference/unintended side effects such as the Structurizer adding branches if we split a PHI.
Then it could be generalized and made smarter to think about more than just PHIs.

Is that close to what you're thinking, or are you thinking about something else entirely?
I've also been wondering if it's possible to just do this in the DAG directly, when splitting values (make it possible to split a CopyFrom/CopyToReg), but that seems like a very big effort.

This revision was landed with ongoing or failed builds.Aug 3 2023, 7:41 AM
This revision was automatically updated to reflect the committed changes.