This patch is based on discussion on the llvmdev mailing list:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-July/087405.html
and also solves:
https://llvm.org/bugs/show_bug.cgi?id=17170
As mentioned on the dev list and bug report, the new loop on the vector register size may cause an unacceptable compile-time increase, so this may need to be shielded by some more aggressive optimization specification. If not, this patch could be extended to other SLP pattern matchers that hardcode the vector register size (see FIXME comments).
The AMDGPU XFAIL test either should be fixed or removed if it's not valid any more?
Shouldn't we update this threshold too? Otherwise, we won't be able to vectorize with VF=32 (and AVX2 might need <32 x i8> vectors).
However, increasing this value change *would* hurt compile time, so we need to be careful here.