Currently we model i16 bswap as very high cost (10),
which doesn't seem right, with all other being at 1.
Regardless of MOVBE, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of 1:
Since i32/i64 already have cost of 1,
MOVBE doesn't improve their costs any further.
BUT, MOVBE must have at least a single memory operand, with other being a register.
Which means, if we have a bswap of load, iff load has a single use, we'll fold bswap into load.
Likewise, if we have store of a bswap, iff bswap has a single use, we'll fold bswap into store.
So i think we should treat such a bswap as free.
clang-format: please reformat the code
- static const CostTblEntry X64CostTbl[] = { // 64-bit targets - { ISD::ABS, MVT::i64, 2 }, // SUB+CMOV - { ISD::BITREVERSE, MVT::i64, 14 }, - { ISD::BSWAP, MVT::i64, 1 }, - { ISD::CTLZ, MVT::i64, 4 }, // BSR+XOR or BSR+XOR+CMOV - { ISD::CTTZ, MVT::i64, 3 }, // TEST+BSF+CMOV/BRANCH - { ISD::CTPOP, MVT::i64, 10 }, - { ISD::SADDO, MVT::i64, 1 }, - { ISD::UADDO, MVT::i64, 1 }, - { ISD::UMULO, MVT::i64, 2 }, // mulq + seto55 diff lines are omitted. See full path.