This is a follow-up of D57056.
On Jaguar, we need to account for an additional operand latency of 6cy (caused by bypass delays) in the case of scalar_int-to-float conversions.
The latency of (V)CVTSI2S(S|D) should be f+3; In this context, f is a bypass delay of 6cy (see AMD fam16h SOG).
This patch marks the input gpr operand as ReadIntToFpu, so that we correctly account for that delay. That quantity has then be subtacted to the opcode latency (which should just be 3cy).
I verified that latency/throughput numbers from llvm-mca have improved, and now they better match what is reported by perf on Jaguar. That being said, I still see cases where the IPC as reported by llvm-mca doesn't quite match the IPC from perf.
Example:
vcvtsi2ss %ecx, %xmm0, %xmm0 # Should tend to IPC: 0.33. Perf reports IPC: 0.25 (one cvt every 4cy).
I suspect that local forwarding might be disabled for it; it looks like users have to wait for an extra +1cy. That would explain the 0.25. For now I decided to go with what is in the documents, so we always assume a +3cy latency.
Latency for the RM variants has changed (it has slightly improved). However, we need another patch to fix the number of opcodes (it should be 1, not 2).
One minor comment - we should probably add this to the avx512f equivalents for consistency.