Not entirely sure if this is generaly good given X86's poor variable
shuffle support in earlier SSE versions.
These are very undertested intrinsics. I can improve that, but wanted
to get feedback if this generally made sense or if we needed a way
to limit to cases that have good variable shift support.