In some cases register copies between i16 and i32 are created, but there is no physical register copy implementation for them. This usually happens each time a non-coherent load is emitted for 1- or 2-byte int types. This patch adds support for that.
It is possible this would have to be fixed before the copies are emitted, given that we get things like:
ld.global.nc.u8 %rs7, [%rd81]
cvt.u32.u16 %r56, %rs7;
which means we have an implicit cast from u8 to u16, which suggest that something may be wrong. Nevertheless, for the testcases I have access to, this fix works just fine.
Out of curiosity, which code pattern leads to 16-to-32 copy? Normally, COPY traditionally copies the same type. Later, I added int-to-float and float-to-int because bitcast does that.