This is, hopefully, the final patch needed to resolve PR21711 ( http://llvm.org/bugs/show_bug.cgi?id=21711 ).
The 'f3' test case in that report presents a situation where we have two 128-bit stores extracted from a 256-bit source vector. Instead of producing this:
vmovaps %xmm0, (%rdi) vextractf128 $1, %ymm0, 16(%rdi)
This patch merges the 128-bit stores into a single 256-bit store:
vmovups %ymm0, (%rdi)
To minimize changes, this patch is limited to handling the single pattern of extract_subvectors feeding into stores, but I've included test cases for the other store merge patterns that could be handled similarly: build_vectors of constants and vector loads feeding into vector stores that could be widened.