I have seen cases in which the MaxInterleaveFactor 4 makes better performance against MaxInterleaveFactor 2.
Let's see a simple example.
void test(char *dstPtr, const char *srcPtr, char *dstEnd) {
do {
memcpy(dstPtr, srcPtr, 8);
dstPtr += 8;
srcPtr += 8;
} while (dstPtr < dstEnd);
}InstCombine pass converts the memcpy into load and store because the length is 8.
The vecotrized assembly output from MaxInterleaveFactor 2 and 4 are as below.
MaxInterleaveFactor 2
.LBB0_7: // %vector.body
// =>This Inner Loop Header: Depth=1
ldp q0, q1, [x13, #-16]
add x13, x13, #32
subs x14, x14, #4
stp q0, q1, [x12, #-16]
add x12, x12, #32
b.ne .LBB0_7
MaxInterleaveFactor 4
.LBB0_7: // %vector.body
// =>This Inner Loop Header: Depth=1
ldp q0, q1, [x12, #-32]
subs x14, x14, #8
ldp q2, q3, [x12], #64
stp q0, q1, [x13, #-32]
stp q2, q3, [x13], #64
b.ne .LBB0_7Given the number of instructions, the output of MaxInterleaveFactor 4 could handle 2 times more data ideally than MaxInterleaveFactor 2 one per iteration.