This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Aggressively fold immediates in SIShrinkInstructions
ClosedPublic

Authored by foad on Nov 26 2021, 7:56 AM.

Details

Summary

Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Diff Detail

Event Timeline

foad created this revision.Nov 26 2021, 7:56 AM
foad requested review of this revision.Nov 26 2021, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 26 2021, 7:56 AM
foad added a comment.Nov 26 2021, 8:05 AM

Some data on the comined effect of D114643 + D114644, from statically compiling a corpus of 10320 graphics shaders for gfx1010:

  • Total number of instructions decreased from 6071567 to 5999110 (-1.2%)
  • Total number of code bytes increased from 35932468 to 36174540 (+0.67%)
  • Total number of vgprs used decreased from 517395 to 517238 (-0.030%)
  • Total number of sgprs used decreased from 811411 to 805549 (-0.73%)
foad added inline comments.Nov 26 2021, 8:09 AM
llvm/test/CodeGen/AMDGPU/madmk.ll
34–36

Not a regression, but it's a bit sad that we don't form madmk here either before or after this patch.

foad updated this revision to Diff 412372.Mar 2 2022, 3:40 AM

Rebase.

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 3:40 AM
foad updated this revision to Diff 429736.May 16 2022, 8:55 AM

Rebase.

Some data on the comined effect of D114643 + D114644, from statically compiling a corpus of 10320 graphics shaders for gfx1010:

  • Total number of instructions decreased from 6071567 to 5999110 (-1.2%)
  • Total number of code bytes increased from 35932468 to 36174540 (+0.67%)
  • Total number of vgprs used decreased from 517395 to 517238 (-0.030%)
  • Total number of sgprs used decreased from 811411 to 805549 (-0.73%)

Redoing this analysis, for gfx900:

  • Total number of instructions decreased from 5839766 to 5790517 (-0.84%)
  • Total number of code bytes increased from 30480844 to 30727840 (+0.81%)
  • Total number of readlane/writelane instructions decreased from 64049 to 62081 (-3.07%)
  • Total number of vgprs used increased from 479581 to 479702 (+0.03%)
  • Total number of sgprs used decreased from 766214 to 760162 (-0.79%)

For gfx1030:

  • Total number of instructions decreased from 6070932 to 6006155 (-1.07%)
  • Total number of code bytes increased from 31346752 to 31645184 (+0.95%)
  • Total number of readlane/writelane instructions decreased from 58297 to 56368 (-3.31%)
  • Total number of vgprs used decreased from 558964 to 558482 (-0.09%)
  • Total number of sgprs used decreased from 805633 to 800257 (-0.67%)
  • Total number of v_cmpx instructions increased from 26303 to 26579 (+1.05%)

The number of readlane/writelane instructions is an indication of how often sgprs get spilled into vgprs.

The reason for the increase in v_cmpx matching is that sometimes we get sequences like this:

s_mov_b32 s26, 0x3b23d70a
...
v_cmp_ngt_f32_e32 vcc_lo, s26, v17
s_and_saveexec_b32 s26, vcc_lo

This can't be converted to use v_cmpx because the uses of s26 would overlap:

s_mov_b32 s26, 0x3b23d70a
...
s_mov_b32 s26, exec_lo // clobbers s26 !!!
v_cmpx_ngt_f32_e32 s26, v17

But with the constant folded into the v_cmp instruction, it is fine:

s_mov_b32 s26, exec_lo
v_cmpx_ngt_f32_e32 0x3b23d70a, v17
foad added a comment.May 17 2022, 3:24 AM

I'd like to go ahead with D114643 + D114644 on the grounds that instruction count, register pressure and register dependencies are much more important for performance than code size in bytes. Does anyone have concerns about this or objections?

piotr added a comment.May 17 2022, 3:30 AM

I am strongly in favour of this change. On our target, the sgpr improvements that manifest themselves in reduced sgpr spilling outweighs the code size increase.

I am strongly in favour of this change. On our target, the sgpr improvements that manifest themselves in reduced sgpr spilling outweighs the code size increase.

+1. The change is LGTM. Let's wait for others to speak up.

arsenm accepted this revision.May 17 2022, 12:14 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
119

Do we have a test where constant folding happens leaving behind debug info?

This revision is now accepted and ready to land.May 17 2022, 12:14 PM
foad added inline comments.May 18 2022, 2:53 AM
llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
119

I don't know. I get the impression that most of the code that ignores debug uses in our backend is just speculative, and never gets exercised in anger.

This revision was landed with ongoing or failed builds.May 18 2022, 3:04 AM
This revision was automatically updated to reflect the committed changes.