I have seen cases in which the MaxInterleaveFactor 4 makes better performance against MaxInterleaveFactor 2.
Let's see a simple example.
void test(char *dstPtr, const char *srcPtr, char *dstEnd) { do { memcpy(dstPtr, srcPtr, 8); dstPtr += 8; srcPtr += 8; } while (dstPtr < dstEnd); }
InstCombine pass converts the memcpy into load and store because the length is 8.
The vecotrized assembly output from MaxInterleaveFactor 2 and 4 are as below.
MaxInterleaveFactor 2 .LBB0_7: // %vector.body // =>This Inner Loop Header: Depth=1 ldp q0, q1, [x13, #-16] add x13, x13, #32 subs x14, x14, #4 stp q0, q1, [x12, #-16] add x12, x12, #32 b.ne .LBB0_7 MaxInterleaveFactor 4 .LBB0_7: // %vector.body // =>This Inner Loop Header: Depth=1 ldp q0, q1, [x12, #-32] subs x14, x14, #8 ldp q2, q3, [x12], #64 stp q0, q1, [x13, #-32] stp q2, q3, [x13], #64 b.ne .LBB0_7
Given the number of instructions, the output of MaxInterleaveFactor 4 could handle 2 times more data ideally than MaxInterleaveFactor 2 one per iteration.