Always keep wmma instructions with a 16-bit floating-point accumulator

as two-address instruction.

This is a prerequesite for an upcoming optimization for wmma with 16-bit

accumulator matrices.

We want to pack the results of two separate

`wmma`s into the same register, so one matrix

is in the lower half while the other matrix is

in the upper half of the registers.

We pack the values into the registers before using them

in the first `wmma` as input:

v_wmma_f16_16x16x16_f16 v[0:7], v[8:15], v[16:23], v[0:7] v_wmma_f16_16x16x16_f16 v[0:7], v[24:31], v[32:49], v[0:7] op_sel:[0,0,1]

Therefore, both instructions need to write to the same registers

and overwrite the values of the input matrices.

We have verified the correct behavior by

running nod.ai's Stable Diffusion with these

changes in data layout.

On average, this change reduced the vgpr count by 17.17% (in 88 shaders

that the change applied to).