This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Dynamically clear renamable to avoid constant bus errors
AbandonedPublic

Authored by critson on Sep 13 2020, 12:27 AM.

Details

Summary

Replace static disabling of renaming of instructions arguments
(to avoid overloading the constant bus) with dynamic clearing of
renamable flags. This allows Machine Copy Propagation to remove
more copies, in particular on GFX10 where the constant bus is
wider.

Diff Detail

Event Timeline

critson created this revision.Sep 13 2020, 12:27 AM
critson requested review of this revision.Sep 13 2020, 12:27 AM
foad added a comment.Sep 14 2020, 1:16 AM

Do you have any stats on how many more copies get propagated, and how many instructions that actually saves in the final ISA? I have a feeling that SIFoldOperands currently catches a lot of the cases that MachineCopyProp misses.

Where/why is renameable set in the first place? Can we just avoid setting it to begin with?

foad added a comment.Sep 14 2020, 7:41 AM

Where/why is renameable set in the first place? Can we just avoid setting it to begin with?

Looks like it was originally done to avoid passes like MachineCopyPropagation from introducing constant bus violations: rG1d531013876c02b18df678a5f67d6a7d94e392b9

Where/why is renameable set in the first place? Can we just avoid setting it to begin with?

Looks like it was originally done to avoid passes like MachineCopyPropagation from introducing constant bus violations: rG1d531013876c02b18df678a5f67d6a7d94e392b9

The actual flags are set by Virtual Register Rewriter, true for most things.
The patch Jay referenced causes the renamable flags to be statically disabled for specific opcodes (i.e. all operands).

rampitec added inline comments.Sep 14 2020, 10:29 AM
llvm/lib/Target/AMDGPU/SIFixRenamableFlags.cpp
96

Same as RI.isVectorRegister().

98

Looks like you can get the same result with a Set/SetVector.

130

RI.isVectorRegister().

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3599

This does not seem to be correct? I.e. s[0:1] and s1 overlap, but count as two constant bus operands. I think even s[0:1] and s0 are separate operands. I see that it is copied from below, but seems to be an error anyway?

The numbers for this change are not vastly compelling.
I looked at 11598 game shaders and compiled these for GFX7, GFX9 and GFX10.
On GFX7, 1 shader lost 1 instruction.
On GFX9, 1 shader lost 1 instruction, but 64 shaders gained 1 instruction.
On GFX10, 1 shader lost 1 instruction, but 2 shaders gained 1 instruction.

I started this change as I am looking at moving WQM after MI scheduling, and this potentially leaves some additional copies around.
But I could addressed these with very limited special case copy elimination in the WQM pass itself.

Does anyone have an opinion on whether I should continue pushing this?

The numbers for this change are not vastly compelling.
I looked at 11598 game shaders and compiled these for GFX7, GFX9 and GFX10.
On GFX7, 1 shader lost 1 instruction.
On GFX9, 1 shader lost 1 instruction, but 64 shaders gained 1 instruction.
On GFX10, 1 shader lost 1 instruction, but 2 shaders gained 1 instruction.

I started this change as I am looking at moving WQM after MI scheduling, and this potentially leaves some additional copies around.
But I could addressed these with very limited special case copy elimination in the WQM pass itself.

Does anyone have an opinion on whether I should continue pushing this?

Honestly it does not sound like a good justification for a new pass.

critson abandoned this revision.Sep 16 2020, 1:21 AM

Follow up verifier fix for issue noted by Stas in D87748.

foad added a comment.Sep 16 2020, 1:31 AM

The numbers for this change are not vastly compelling.
I looked at 11598 game shaders and compiled these for GFX7, GFX9 and GFX10.
On GFX7, 1 shader lost 1 instruction.
On GFX9, 1 shader lost 1 instruction, but 64 shaders gained 1 instruction.
On GFX10, 1 shader lost 1 instruction, but 2 shaders gained 1 instruction.

Are there any other significant changes? I'm thinking of things like this that would help with dependency stalls on gfx10:

v_mov v0, 0
v_mov v1, v0

->

v_mov v0, 0
v_mov v1, 0

I started this change as I am looking at moving WQM after MI scheduling, and this potentially leaves some additional copies around.
But I could addressed these with very limited special case copy elimination in the WQM pass itself.

Does anyone have an opinion on whether I should continue pushing this?

Honestly it does not sound like a good justification for a new pass.

Agreed, but maybe it's worth revisiting after the WQM pass has been moved.

Are there any other significant changes? I'm thinking of things like this that would help with dependency stalls on gfx10:

v_mov v0, 0
v_mov v1, v0

->

v_mov v0, 0
v_mov v1, 0

I started this change as I am looking at moving WQM after MI scheduling, and this potentially leaves some additional copies around.
But I could addressed these with very limited special case copy elimination in the WQM pass itself.

Does anyone have an opinion on whether I should continue pushing this?

Honestly it does not sound like a good justification for a new pass.

Agreed, but maybe it's worth revisiting after the WQM pass has been moved.

I will review shader changes; however, I do not think Machine Copy Propagation propagates constants.

foad added a comment.Sep 16 2020, 1:52 AM

Are there any other significant changes? I'm thinking of things like this that would help with dependency stalls on gfx10:

v_mov v0, 0
v_mov v1, v0

->

v_mov v0, 0
v_mov v1, 0

I will review shader changes; however, I do not think Machine Copy Propagation propagates constants.

True. But this can also help with stalls:

v_mov v0, s0
v_mov v1, v0

->

v_mov v0, s0
v_mov v1, s0

I will review shader changes; however, I do not think Machine Copy Propagation propagates constants.

True. But this can also help with stalls:

v_mov v0, s0
v_mov v1, v0

->

v_mov v0, s0
v_mov v1, s0

I looked over all the GFX10 diffs and could not find anything useful happening except a handful of rewrites of v_cndmask to use an additional SGPR argument; however, in practice none these would affect stalls. This is not too surprising v_cndmask was the made target of this when I wrote it, but for the specific code generation use case (moving WQM) I will take care of these at code generation time instead.