The shuffle combining and lowerVectorShuffleAsLanePermuteAndBlend were both still trying to use VPERM2XF128 for unary shuffles when AVX2 is enabled. VPERM2X128 takes two inputs meaning when we use it for a unary shuffle one of those inputs is left undefined creating a false dependency on whatever register gets allocated there.
If we have VPERMQ/PD we should prefer those since they only have a single input.
What happens if we just replace this with a generic DAG.getVectorShuffle? I'm thinking about when we finally get around to supporting 512-bit in lowerVectorShuffleAsLanePermuteAndBlend.