Page MenuHomePhabricator

[AMDGPU] Remove the gfx10 VALU register destination cache model
AcceptedPublic

Authored by foad on Mar 10 2020, 5:05 AM.

Details

Summary

It complicates the model and we don't yet have any evidence that it
improves benchmarks.

Diff Detail

Event Timeline

foad created this revision.Mar 10 2020, 5:05 AM

According to spec "code should be scheduled to reuse data from destination cache whenever possible to reduce RAM reads both for power and performance. The destination cache retires results into the VGPRs at a rate of 1 result per VGPR-bank per cycle."

I.e. from the spec point of view some specific scheduling is needed here even though the impact is small. You may see no impact at all because GCNRegBankReassign fights most of the bank conflicts. It makes sense to remeasure with reassign disabled (-amdgpu-reassign-regs=0). I also think this modeling may be improved, in particular in respect to even and add halves of the cache, but not removed entirely.

foad added a comment.Mar 10 2020, 12:18 PM

According to spec "code should be scheduled to reuse data from destination cache whenever possible to reduce RAM reads both for power and performance. The destination cache retires results into the VGPRs at a rate of 1 result per VGPR-bank per cycle."

I.e. from the spec point of view some specific scheduling is needed here even though the impact is small. You may see no impact at all because GCNRegBankReassign fights most of the bank conflicts. It makes sense to remeasure with reassign disabled (-amdgpu-reassign-regs=0). I also think this modeling may be improved, in particular in respect to even and add halves of the cache, but not removed entirely.

If HWRC is modelling a cache for VALU instructions writing VGPRs then why do we use it for non-VALU instructions like:

def : HWWriteRes<WriteSALU,          [HWSALU,   HWRC], 5>;
def : HWWriteRes<WriteSMEM,          [HWLGKM,   HWRC], 20>;

?

According to spec "code should be scheduled to reuse data from destination cache whenever possible to reduce RAM reads both for power and performance. The destination cache retires results into the VGPRs at a rate of 1 result per VGPR-bank per cycle."

I.e. from the spec point of view some specific scheduling is needed here even though the impact is small. You may see no impact at all because GCNRegBankReassign fights most of the bank conflicts. It makes sense to remeasure with reassign disabled (-amdgpu-reassign-regs=0). I also think this modeling may be improved, in particular in respect to even and add halves of the cache, but not removed entirely.

If HWRC is modelling a cache for VALU instructions writing VGPRs then why do we use it for non-VALU instructions like:

def : HWWriteRes<WriteSALU,          [HWSALU,   HWRC], 5>;
def : HWWriteRes<WriteSMEM,          [HWLGKM,   HWRC], 20>;

?

The same exists for SGPRs.

The number of test changes is massive, but looks pretty benign and rather points at weaknesses elsewhere.

There are three purposes to the destination cache:

  1. Provide some slack if write-back to the register file is contested.
  2. Re-use recently computed results to reduce register read bank conflicts.
  3. Re-use recently computed results to save power.

We should probably just ignore point 1, and by far the main impact should be point 3. Either way, the moral of it is that if we can schedule dependent instructions to land *exactly* after the latency of the instructions that they depend on, then that's a win -- but it's a weak one and should be deprioritized relative to almost all other concerns.

I find the mixing up of VALU and other instruction types dubious here as well, because even if they do have similar mechanisms (and other than SALU they really don't), they are *separate* mechanisms and would presumably have to be modeled via a separate resource.

The bigger problem is that I don't see how the existing modeling via HWRC models what we want in the first place, and the test changes aren't particularly encouraging from the point of view of keeping it. If there isn't a really good argument, I'd agree with Jay that erring on the side of simplicity is the right thing to do here.

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.update.dpp.ll
65–68

This is a regression because the distance between dependent instructions is reduced too much. But before worrying about that, we need to worry about what looks like an unnecessary move from SGPR.

llvm/test/CodeGen/AMDGPU/idot2.ll
112–115

This is an improvement because it eliminates a stall.

260–262

These should be scalar instructions in the first place, but the change eliminates a stall.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.sample.a16.dim.ll
789–794

This change just doesn't matter.

llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll
467–470

The fact that the computation of v2 and v3 are moved earlier is a regression, but I would consider it unrelated. It rather points to the fact that the scheduler doesn't understand just how ridiculously long the latency of VMEM instructions is, and that adding those extra VALUs between it and the computation of its address doesn't actually help at all.

nhaehnle accepted this revision.Mar 11 2020, 4:01 AM

I'm going to say preliminarily that this change LGTM, but give it some time in case a better argument for keeping the HWRC comes up.

This revision is now accepted and ready to land.Mar 11 2020, 4:01 AM
foad marked an inline comment as done.Mar 11 2020, 4:03 AM

Either way, the moral of it is that if we can schedule dependent instructions to land *exactly* after the latency of the instructions that they depend on, then that's a win -- but it's a weak one and should be deprioritized relative to almost all other concerns.

My gut feeling is that in most cases the scheduler would already like to use results ASAP, to reduce overall latency. Teaching it that the advantage of using a result ASAP instead of 1 cycle later is more than just saving 1 cycle, would be an incredibly subtle change.

The bigger problem is that I don't see how the existing modeling via HWRC models what we want in the first place

Exactly. It defines a resource that is used for 1 cycle by every register write, but I don't understand how that is supposed to improve the scheduling.

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.update.dpp.ll
65–68

It's hard for the (pre-RA) scheduler to do a good job with copies/moves because it doesn't know what the register allocator will do with them. So I think this is probably just luck.

Can we run perf measurements with and without this change *and* -amdgpu-reassign-regs=0 applied in both cases? If the difference will be noise then I do not object to remove it.