This is the patch to fix the performance problem reported in https://llvm.org/bugs/show_bug.cgi?id=23510.
Many X86 scalar instructions support using memory operand as destination but most vector instructions do not support it. In SLP cost evaluation,
scalar version:
t1 = load [mem]; t1 = shift 5, t1 store t1, [mem] ... t4 = load [mem4]; t4 = shift 5, t4 store t4, [mem4]
slp vectorized version:
v1 = vload [mem]; v1 = vshift 5, v1 store v1, [mem]
SLP cost model thinks there will be 12 - 3 = 9 insns savings. But scalar version can be converted to the following form on x86 while vectorized instruction cannot:
[mem1] = shift 5, [mem1]
[mem2] = shift 5, [mem2]
[mem3] = shift 5, [mem3]
[mem4] = shift 5, [mem4]
We add the extra cost VL * 2 to the SLP cost evaluation to handle such case (VL is the vector length).