Page MenuHomePhabricator

[AMDGPU] Improve scheduling model for VOP3b instructions
AcceptedPublic

Authored by foad on Mar 10 2020, 5:05 AM.

Details

Summary

VOP3b instructions like v_addc_u32 write vcc (an sgpr) as well as a vgpr
result. The way this was modelled made the write to vcc take an extra
micro-op, which made the whole instruction take twice as long to issue,
which is inaccurate.

Fix this by introducing a new write class that doesn't consume any
resources.

Diff Detail

Event Timeline

foad created this revision.Mar 10 2020, 5:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2020, 5:05 AM
arsenm accepted this revision.Mar 10 2020, 8:10 AM
This revision is now accepted and ready to land.Mar 10 2020, 8:10 AM
rampitec added inline comments.Mar 10 2020, 10:49 AM
llvm/lib/Target/AMDGPU/SISchedule.td
32

The name suggests it is any VCC write, which is not so. In addition it is not always a VCC. Maybe change to WriteAuxSGPR?

llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
989

What about AMDGPUMacroFusion which tries to do exactly the opposite?

foad marked 2 inline comments as done.Mar 11 2020, 3:20 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SISchedule.td
32

The name suggests it is any VCC write, which is not so.

Good point. I will try to come up with a better name.

In addition it is not always a VCC.

True but the name gives you a hint that it is usually VCC (i.e. we try to persuade the register allocator to use VCC, and if it does then we can use a smaller instruction encoding).

llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
989

Why do you say "the opposite"? Macro fusion tries to put the v_add next to the v_addc (but apparently it fails in this case). My patch should not stop this from working.

rampitec added inline comments.Mar 11 2020, 11:16 AM
llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
989

Aren't you adding a latency between vcc def and its use?

foad marked 3 inline comments as done.Mar 11 2020, 11:35 AM
foad added inline comments.
llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
989

No I'm just replacing WriteSALU with WriteVCC which has the same latency. But macro fusion overrides this anyway and forces the latency to 0 for any dependencies between the instructions that it fuses.

Mostly LGTM, except for the name of the resource.

llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
989

OK, makes sense.

foad marked 2 inline comments as done.Mar 12 2020, 9:31 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SISchedule.td
32

How about Write32BitAux ? The fact that it is writing an SGPR is not really important, all that matters is the latency, which should be the same as for the first def operand.

So v_addc would be [Write32Bit, Write32BitAux], v_div_scale_f64 would be [WriteDouble, WriteDoubleAux] and so on.

rampitec added inline comments.Mar 12 2020, 10:58 AM
llvm/lib/Target/AMDGPU/SISchedule.td
32

I think you do not need 2 resources, both 32 bit and 64 bit write the same mask. Maybe just WriteAux?

foad updated this revision to Diff 250189.Mar 13 2020, 6:24 AM

Switch to using multiple "Aux" sched write classes.

foad added a comment.Mar 13 2020, 6:26 AM

I think you do not need 2 resources, both 32 bit and 64 bit write the same mask. Maybe just WriteAux?

I want the write to VCC to have the same latency as the main result of the instruction. That's why I used multiple different "Aux" classes as in the updated patch.

foad updated this revision to Diff 333002.Mar 24 2021, 8:20 AM

Rebase without D75909.