This patch generalizes the lowering of shuffles as zero extensions to allow extensions that don't start from the first element. It now recognises extensions starting anywhere in the lower 128-bits or at the start of any higher 128-bit lane.
The motivation was to reduce the number of high cost pshufb calls, but it also improves the SSE2 case as well.
This is definitely a borderline case - without the earlyout most of the time we are just replacing a XOR (zero)+PUNCKH with a PSHUFD+PMOVZX. There's next to nothing in it so I went with avoiding a change.