It complicates the model and we don't yet have any evidence that it
According to spec "code should be scheduled to reuse data from destination cache whenever possible to reduce RAM reads both for power and performance. The destination cache retires results into the VGPRs at a rate of 1 result per VGPR-bank per cycle."
I.e. from the spec point of view some specific scheduling is needed here even though the impact is small. You may see no impact at all because GCNRegBankReassign fights most of the bank conflicts. It makes sense to remeasure with reassign disabled (-amdgpu-reassign-regs=0). I also think this modeling may be improved, in particular in respect to even and add halves of the cache, but not removed entirely.
If HWRC is modelling a cache for VALU instructions writing VGPRs then why do we use it for non-VALU instructions like:
def : HWWriteRes<WriteSALU, [HWSALU, HWRC], 5>; def : HWWriteRes<WriteSMEM, [HWLGKM, HWRC], 20>;
The number of test changes is massive, but looks pretty benign and rather points at weaknesses elsewhere.
There are three purposes to the destination cache:
- Provide some slack if write-back to the register file is contested.
- Re-use recently computed results to reduce register read bank conflicts.
- Re-use recently computed results to save power.
We should probably just ignore point 1, and by far the main impact should be point 3. Either way, the moral of it is that if we can schedule dependent instructions to land *exactly* after the latency of the instructions that they depend on, then that's a win -- but it's a weak one and should be deprioritized relative to almost all other concerns.
I find the mixing up of VALU and other instruction types dubious here as well, because even if they do have similar mechanisms (and other than SALU they really don't), they are *separate* mechanisms and would presumably have to be modeled via a separate resource.
The bigger problem is that I don't see how the existing modeling via HWRC models what we want in the first place, and the test changes aren't particularly encouraging from the point of view of keeping it. If there isn't a really good argument, I'd agree with Jay that erring on the side of simplicity is the right thing to do here.
This is a regression because the distance between dependent instructions is reduced too much. But before worrying about that, we need to worry about what looks like an unnecessary move from SGPR.
This is an improvement because it eliminates a stall.
These should be scalar instructions in the first place, but the change eliminates a stall.
This change just doesn't matter.
The fact that the computation of v2 and v3 are moved earlier is a regression, but I would consider it unrelated. It rather points to the fact that the scheduler doesn't understand just how ridiculously long the latency of VMEM instructions is, and that adding those extra VALUs between it and the computation of its address doesn't actually help at all.
My gut feeling is that in most cases the scheduler would already like to use results ASAP, to reduce overall latency. Teaching it that the advantage of using a result ASAP instead of 1 cycle later is more than just saving 1 cycle, would be an incredibly subtle change.
The bigger problem is that I don't see how the existing modeling via HWRC models what we want in the first place
Exactly. It defines a resource that is used for 1 cycle by every register write, but I don't understand how that is supposed to improve the scheduling.
It's hard for the (pre-RA) scheduler to do a good job with copies/moves because it doesn't know what the register allocator will do with them. So I think this is probably just luck.