This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fold more AGPR copies/PHIs in SIFoldOperands
ClosedPublic

Authored by Pierre-vh on Feb 15 2023, 5:22 AM.

Details

Summary

Generalize tryFoldLCSSAPhi into tryFoldPhiAGPR which works
on any kind of PHI node (not just LCSSA ones) and attempts to
create AGPR Phis more aggressively.

Also adds a GFX908-only "cleanup" function tryOptimizeAGPRPhis
which tries to minimize AGPR to AGPR copies on GFX908, which doesn't
have a ACCVGPR MOV instruction (so AGPR-AGPR copies become 2 or 3 instructions
as they need a VGPR temp). The reason why this is needed is because D143731
+ the new tryFoldPhiAGPR may create a lot more PHIs (one 32xfloat PHI becomes
32 float phis), and if each PHI hits the same AGPR (like in test_mfma_loop_agpr_init)
they will be lowered to 32 copies from the same AGPR, which will each become 2-3 instructions.
Creating a VGPR cache in this case prevents all those copies from being generated
(we have AGPR-VGPR copies instead which are trivial).

This is a prepation patch intended to prevent regressions in D143731 when
AGPRs are involved.

Diff Detail

Event Timeline

Pierre-vh created this revision.Feb 15 2023, 5:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 15 2023, 5:22 AM
Pierre-vh requested review of this revision.Feb 15 2023, 5:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 15 2023, 5:22 AM

Should have a MIR test. Also feels like it's more complicated than it needs to be to solve this

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1646

If SIFoldOperands worked like PeepholeOpt, you would have a map of these already. I'd rather avoid a linear scan backwards for every copy

1687

This can't happen

1689

I'm somewhat surprised we allow subregs on phi inputs but the verifier doesn't seem to check this

1759

Don't worry about setting kills, they're semi-deprecated and will be recomputed fully later anyway. Pre-regalloc you only need to worry about unsetting ones you may have had before

Pierre-vh added inline comments.Feb 15 2023, 11:55 PM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1646

Do you mean that it's fine to store these in a map, or that we can't do that here?
I'd also like to avoid the backwards scan if possible

Pierre-vh updated this revision to Diff 497965.Feb 16 2023, 4:55 AM
Pierre-vh marked 3 inline comments as done.

The last big regression is now in test_mfma_loop_agpr_init on gfx908 where agpr-to-agpr copies are annoying.
I'm not sure yet how to fix that one, it seems to be because after PHIs are removed, we end up with copies like

%83.sub1:areg_1024 = COPY %83.sub0:areg_1024

all over the place so after RA/removing RA pseudos, we get a lot of code to copy sub0 to sub1-31.

Should have a MIR test. Also feels like it's more complicated than it needs to be to solve this

I will definitely write a MIR test (and remove the WIP tag) once this diff is in the right direction (no major changes needed). I just didn't write the test yet because I'm still not sure this is headed in the right direction and how many changes will be needed

Pierre-vh planned changes to this revision.Feb 17 2023, 2:18 AM
Pierre-vh updated this revision to Diff 498329.Feb 17 2023, 4:22 AM

Fix remaining cases, add MIR test

Note that the MIR test is just a basic test, most of the interesting testing is
done in D143731 with mfma-loop.ll

Pierre-vh retitled this revision from [WIP][AMDGPU] Fold more AGPR copies/PHIs in SIFoldOperands to [AMDGPU] Fold more AGPR copies/PHIs in SIFoldOperands.Feb 17 2023, 4:23 AM
Pierre-vh edited the summary of this revision. (Show Details)
Pierre-vh added reviewers: arsenm, foad, rampitec.
arsenm added inline comments.Feb 20 2023, 12:53 PM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1646

I mean I don't like the current use iteration SIFoldOperands currently uses, and it would be better if it was rewritten to collect a map of known foldable defs as it walked. I think you can't simply introduce such a map without redoing the whole pass structure

1700

Direct class equality checks are usually the wrong thing to do. Something like isSubclassEq or constrain to compatible subclass. Don't think there's any practical difference in this case

1850

Don't see why you need to build this map/vector. You can just start inserting the instructions after all the phis as you process each one

Pierre-vh marked 2 inline comments as done.Feb 21 2023, 2:12 AM
Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1700

Here I want to make sure the RC is identical because I can create new register/copies to that RC. Should I still replace equality checks with isSubclassEq?

1850

I only insert VGPR temps if there is more than one use across all PHIs, so I first collect all the PHIs then check the operands I collected. A worklist seems more appropriate, though I could maybe do it with users in one go without building the map but it seems like it would be more complicated

I just noticed that GCNPreRAOptimizations does try to do some fixups for the intermediate AGPR case, maybe worth looking into why that didn't work here

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1700

It doesn't really matter, I think this is only a theoretical problem

1842

Instruction comment like above?

Pierre-vh updated this revision to Diff 500755.Feb 27 2023, 6:13 AM
Pierre-vh marked 4 inline comments as done.

Comments

arsenm accepted this revision.Mar 14 2023, 3:23 PM

Not my favorite but this is for the DAG

This revision is now accepted and ready to land.Mar 14 2023, 3:23 PM
This revision was landed with ongoing or failed builds.Mar 28 2023, 12:33 AM
This revision was automatically updated to reflect the committed changes.