Always keep wmma instructions with a 16-bit floating-point accumulator
as two-address instruction.
This is a prerequesite for an upcoming optimization for wmma with 16-bit
We want to pack the results of two separate
wmmas into the same register, so one matrix
is in the lower half while the other matrix is
in the upper half of the registers.
We pack the values into the registers before using them
in the first wmma as input:
v_wmma_f16_16x16x16_f16 v[0:7], v[8:15], v[16:23], v[0:7] v_wmma_f16_16x16x16_f16 v[0:7], v[24:31], v[32:49], v[0:7] op_sel:[0,0,1]
Therefore, both instructions need to write to the same registers
and overwrite the values of the input matrices.
We have verified the correct behavior by
running nod.ai's Stable Diffusion with these
changes in data layout.
On average, this change reduced the vgpr count by 17.17% (in 88 shaders
that the change applied to).