This patch is based on discussion on the llvmdev mailing list:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-July/087405.html
and also solves:
https://llvm.org/bugs/show_bug.cgi?id=17170
As mentioned on the dev list and bug report, the new loop on the vector register size may cause an unacceptable compile-time increase, so this may need to be shielded by some more aggressive optimization specification. If not, this patch could be extended to other SLP pattern matchers that hardcode the vector register size (see FIXME comments).
The AMDGPU XFAIL test either should be fixed or removed if it's not valid any more?