vperm2* intrinsics are just shuffles unless a zero mask bit is set. In a few special cases, they're not even shuffles.
Optimizing intrinsics in InstCombine is better than handling this in the front-end for at least two reasons:
- Optimizing custom-written SSE intrinsic code at -O0 makes vector coders really angry (and so I have some regrets about some patches from last week).
- Doing mask conversion logic in header files is hard to write and subsequently read.
Unfortunately, we use a magic number (generally assumed to be -1) to specify undef values in shufflevector masks in IR. And apparently, that magic has led to lax coding where we just check <0 for undef. If we had a proper enum for shufflevector mask special values, we could do like the x86 backend has done and easily transform the zero mask bit cases here too. Fixing that could be a follow-on patch. Otherwise, we'll try to deal with matching a 2-shuffle sequence in the x86 backend. But again, that's a separate patch (see the TODO comment in this one).