Replace static disabling of renaming of instructions arguments
(to avoid overloading the constant bus) with dynamic clearing of
renamable flags. This allows Machine Copy Propagation to remove
more copies, in particular on GFX10 where the constant bus is
wider.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Do you have any stats on how many more copies get propagated, and how many instructions that actually saves in the final ISA? I have a feeling that SIFoldOperands currently catches a lot of the cases that MachineCopyProp misses.
Where/why is renameable set in the first place? Can we just avoid setting it to begin with?
Looks like it was originally done to avoid passes like MachineCopyPropagation from introducing constant bus violations: rG1d531013876c02b18df678a5f67d6a7d94e392b9
The actual flags are set by Virtual Register Rewriter, true for most things.
The patch Jay referenced causes the renamable flags to be statically disabled for specific opcodes (i.e. all operands).
llvm/lib/Target/AMDGPU/SIFixRenamableFlags.cpp | ||
---|---|---|
96 | Same as RI.isVectorRegister(). | |
98 | Looks like you can get the same result with a Set/SetVector. | |
130 | RI.isVectorRegister(). | |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
3599 | This does not seem to be correct? I.e. s[0:1] and s1 overlap, but count as two constant bus operands. I think even s[0:1] and s0 are separate operands. I see that it is copied from below, but seems to be an error anyway? |
The numbers for this change are not vastly compelling.
I looked at 11598 game shaders and compiled these for GFX7, GFX9 and GFX10.
On GFX7, 1 shader lost 1 instruction.
On GFX9, 1 shader lost 1 instruction, but 64 shaders gained 1 instruction.
On GFX10, 1 shader lost 1 instruction, but 2 shaders gained 1 instruction.
I started this change as I am looking at moving WQM after MI scheduling, and this potentially leaves some additional copies around.
But I could addressed these with very limited special case copy elimination in the WQM pass itself.
Does anyone have an opinion on whether I should continue pushing this?
Are there any other significant changes? I'm thinking of things like this that would help with dependency stalls on gfx10:
v_mov v0, 0 v_mov v1, v0
->
v_mov v0, 0 v_mov v1, 0
I started this change as I am looking at moving WQM after MI scheduling, and this potentially leaves some additional copies around.
But I could addressed these with very limited special case copy elimination in the WQM pass itself.Does anyone have an opinion on whether I should continue pushing this?
Honestly it does not sound like a good justification for a new pass.
Agreed, but maybe it's worth revisiting after the WQM pass has been moved.
I will review shader changes; however, I do not think Machine Copy Propagation propagates constants.
I will review shader changes; however, I do not think Machine Copy Propagation propagates constants.
True. But this can also help with stalls:
v_mov v0, s0 v_mov v1, v0
->
v_mov v0, s0 v_mov v1, s0
I looked over all the GFX10 diffs and could not find anything useful happening except a handful of rewrites of v_cndmask to use an additional SGPR argument; however, in practice none these would affect stalls. This is not too surprising v_cndmask was the made target of this when I wrote it, but for the specific code generation use case (moving WQM) I will take care of these at code generation time instead.
clang-format suggested style edits found: