I noticed this missed optimization in the CGP memcmp() expansion, and then saw that we don't have the fold in InstCombine.
It wasn't immediately clear to me that a vector bswap swaps the bytes of each element in the vector while leaving the elements in place. Should I add a blurb about that in the LangRef?