Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Some data on the comined effect of D114643 + D114644, from statically compiling a corpus of 10320 graphics shaders for gfx1010:
- Total number of instructions decreased from 6071567 to 5999110 (-1.2%)
- Total number of code bytes increased from 35932468 to 36174540 (+0.67%)
- Total number of vgprs used decreased from 517395 to 517238 (-0.030%)
- Total number of sgprs used decreased from 811411 to 805549 (-0.73%)
llvm/test/CodeGen/AMDGPU/madmk.ll | ||
---|---|---|
34–35 | Not a regression, but it's a bit sad that we don't form madmk here either before or after this patch. |
Redoing this analysis, for gfx900:
- Total number of instructions decreased from 5839766 to 5790517 (-0.84%)
- Total number of code bytes increased from 30480844 to 30727840 (+0.81%)
- Total number of readlane/writelane instructions decreased from 64049 to 62081 (-3.07%)
- Total number of vgprs used increased from 479581 to 479702 (+0.03%)
- Total number of sgprs used decreased from 766214 to 760162 (-0.79%)
For gfx1030:
- Total number of instructions decreased from 6070932 to 6006155 (-1.07%)
- Total number of code bytes increased from 31346752 to 31645184 (+0.95%)
- Total number of readlane/writelane instructions decreased from 58297 to 56368 (-3.31%)
- Total number of vgprs used decreased from 558964 to 558482 (-0.09%)
- Total number of sgprs used decreased from 805633 to 800257 (-0.67%)
- Total number of v_cmpx instructions increased from 26303 to 26579 (+1.05%)
The number of readlane/writelane instructions is an indication of how often sgprs get spilled into vgprs.
The reason for the increase in v_cmpx matching is that sometimes we get sequences like this:
s_mov_b32 s26, 0x3b23d70a ... v_cmp_ngt_f32_e32 vcc_lo, s26, v17 s_and_saveexec_b32 s26, vcc_lo
This can't be converted to use v_cmpx because the uses of s26 would overlap:
s_mov_b32 s26, 0x3b23d70a ... s_mov_b32 s26, exec_lo // clobbers s26 !!! v_cmpx_ngt_f32_e32 s26, v17
But with the constant folded into the v_cmp instruction, it is fine:
s_mov_b32 s26, exec_lo v_cmpx_ngt_f32_e32 0x3b23d70a, v17
I am strongly in favour of this change. On our target, the sgpr improvements that manifest themselves in reduced sgpr spilling outweighs the code size increase.
llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp | ||
---|---|---|
95 | Do we have a test where constant folding happens leaving behind debug info? |
llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp | ||
---|---|---|
95 | I don't know. I get the impression that most of the code that ignores debug uses in our backend is just speculative, and never gets exercised in anger. |
Do we have a test where constant folding happens leaving behind debug info?