The BSWAP of vector types is quite efficiently implemented using vector shuffles on SSE/AVX targets, we should reflect the typical cost of this to encourage vectorization.
Also, we're not making much use of the intrinsic costings on any target - for instance why do we not use this for CTPOP instead of the rather limited getPopcntSupport() approach? CTLZ/CTTZ would probably benefit as well.