Currently we model i16 bswap as very high cost (10),
which doesn't seem right, with all other being at 1.
Regardless of MOVBE, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of 1:
Since i32/i64 already have cost of 1,
MOVBE doesn't improve their costs any further.
BUT, MOVBE must have at least a single memory operand, with other being a register.
Which means, if we have a bswap of load, iff load has a single use, we'll fold bswap into load.
Likewise, if we have store of a bswap, iff bswap has a single use, we'll fold bswap into store.
So i think we should treat such a bswap as free.
You say 'prefer' - is a future intention to alter isel depending on this? I'm not sure that's actually useful.