Generalize tryFoldLCSSAPhi into tryFoldPhiAGPR which works
on any kind of PHI node (not just LCSSA ones) and attempts to
create AGPR Phis more aggressively.
Also adds a GFX908-only "cleanup" function tryOptimizeAGPRPhis
which tries to minimize AGPR to AGPR copies on GFX908, which doesn't
have a ACCVGPR MOV instruction (so AGPR-AGPR copies become 2 or 3 instructions
as they need a VGPR temp). The reason why this is needed is because D143731
+ the new tryFoldPhiAGPR may create a lot more PHIs (one 32xfloat PHI becomes
32 float phis), and if each PHI hits the same AGPR (like in test_mfma_loop_agpr_init)
they will be lowered to 32 copies from the same AGPR, which will each become 2-3 instructions.
Creating a VGPR cache in this case prevents all those copies from being generated
(we have AGPR-VGPR copies instead which are trivial).
This is a prepation patch intended to prevent regressions in D143731 when
AGPRs are involved.
If SIFoldOperands worked like PeepholeOpt, you would have a map of these already. I'd rather avoid a linear scan backwards for every copy