This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Less aggressively break large PHIs
ClosedPublic

Authored by Pierre-vh on Apr 7 2023, 5:55 AM.

Details

Summary

In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).

This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases.
e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.

Fixes SWDEV-392803
Fixes SWDEV-393781
Fixes SWDEV-394228

Diff Detail

Event Timeline

Pierre-vh created this revision.Apr 7 2023, 5:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2023, 5:55 AM
Pierre-vh requested review of this revision.Apr 7 2023, 5:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2023, 5:55 AM
arsenm added inline comments.Apr 7 2023, 3:35 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1412–1431

I'm worried this heuristic is too simple and doesn't really recognize canonical IR. If I run instcombine on the test cases, nearly all of them fold out the insertelement chains

Pierre-vh added inline comments.Apr 11 2023, 12:35 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1412–1431

We don't run InstCombine after CGP, but we could. I tried it and my sample is even smaller with this:
(I haven't checked if it fixed the original issue though, but it should)

  • Trunk w/ amdgpu-codegenprepare-break-large-phis=0: 16314 instructions
  • This patch: 16310 instructions
  • Trunk: 40310
  • Trunk + InstCombiner run after CGP: 13057

If running IC after CGP is okay with you I can create a patch for it.

Pierre-vh added inline comments.Apr 11 2023, 12:53 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1412–1431

Ah, I tried to run IC after CGP but it causes about 300 failures. Is there a way to run IC logic on the added instructions only perhaps?

Pierre-vh added inline comments.Apr 11 2023, 1:25 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1412–1431

Forgot to mention: I don't try to anticipate what instcombine will fold but rather what the DAG will fold.
There's no IC run after CGP, so this tries to only break PHIs when it thinks the DAG can simplify-away the added instructions

Pierre-vh updated this revision to Diff 513196.Apr 13 2023, 5:43 AM
  • Run the test through IC as well to ensure it doesn't break the heuristic
  • Handle insertelement constant operand
Pierre-vh edited the summary of this revision. (Show Details)Apr 13 2023, 9:09 AM

Use (mostly) canonical IR in the tests + add more test cases

Pierre-vh updated this revision to Diff 513532.Apr 14 2023, 4:06 AM

Clean-up comments

arsenm accepted this revision.Apr 14 2023, 4:44 AM

I suspect the extract logic might not work so well for 64-bit elements

This revision is now accepted and ready to land.Apr 14 2023, 4:44 AM

I suspect the extract logic might not work so well for 64-bit elements

What do you mean? Is there a test case you would like me to add in the meantime?
FWIW I didn't observe any regressions with this (only bugfixes and improvements), and most of the test cases had vectors of double

This revision was landed with ongoing or failed builds.Apr 14 2023, 6:41 AM
This revision was automatically updated to reflect the committed changes.