[x86] use a single shufps for 256-bit vectors when it can save instructions
This is the 256-bit counterpart to the 128-bit transform checked in here:
This patch is based on the draft by @sroland (Roland Scheidegger) that is
attached to PR27885:
Looks great to me.