This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Increase MaxInterleaveFactor to 4
Needs ReviewPublic

Authored by jaykang10 on Mar 16 2022, 3:58 AM.

Details

Summary

I have seen cases in which the MaxInterleaveFactor 4 makes better performance against MaxInterleaveFactor 2.
Let's see a simple example.

void test(char *dstPtr, const char *srcPtr, char *dstEnd) {
  do { 
    memcpy(dstPtr, srcPtr, 8);  
    dstPtr += 8;
    srcPtr += 8;
  } while (dstPtr < dstEnd);
}

InstCombine pass converts the memcpy into load and store because the length is 8.
The vecotrized assembly output from MaxInterleaveFactor 2 and 4 are as below.

MaxInterleaveFactor 2
.LBB0_7:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        ldp     q0, q1, [x13, #-16]
        add     x13, x13, #32  
        subs    x14, x14, #4
        stp     q0, q1, [x12, #-16]
        add     x12, x12, #32  
        b.ne    .LBB0_7

MaxInterleaveFactor 4
.LBB0_7:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        ldp     q0, q1, [x12, #-32]
        subs    x14, x14, #8
        ldp     q2, q3, [x12], #64  
        stp     q0, q1, [x13, #-32]
        stp     q2, q3, [x13], #64  
        b.ne    .LBB0_7

Given the number of instructions, the output of MaxInterleaveFactor 4 could handle 2 times more data ideally than MaxInterleaveFactor 2 one per iteration.

Diff Detail

Event Timeline

jaykang10 created this revision.Mar 16 2022, 3:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 3:58 AM
jaykang10 requested review of this revision.Mar 16 2022, 3:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 3:58 AM

Do you have any performance results for changing this? I've not had much luck with trying it in the past, and it obviously can change quite a lot. It can certainly help in places, but I've found that if you turn it up too high you just end up over-unrolling loops, not getting into the fast loop body as much. It can obviously depend on the input code and loop counts tough. Perhaps it needs some better costmodelling?

I was still hoping to get D118979 in because it should help quite a bit - and it on it's own increases the number of items processed per vector element, and this will increase it further. We have cleaned up quite a few of the places it doesn't do as well, there are just a few that have been stuck in review a while. Perhaps it makes sense to try and push that through, then re-evaluate this on top?

Thanks for comment! @dmgreen

Do you have any performance results for changing this? I've not had much luck with trying it in the past, and it obviously can change quite a lot. It can certainly help in places, but I've found that if you turn it up too high you just end up over-unrolling loops, not getting into the fast loop body as much. It can obviously depend on the input code and loop counts tough. Perhaps it needs some better costmodelling?

I was able to see the overall performance number slightly up for an internal benchmark on neoverse-n1 but we would need to tune something like cost model according to micro architectures.

I was still hoping to get D118979 in because it should help quite a bit - and it on it's own increases the number of items processed per vector element, and this will increase it further. We have cleaned up quite a few of the places it doesn't do as well, there are just a few that have been stuck in review a while. Perhaps it makes sense to try and push that through, then re-evaluate this on top?

I agree with you. Let's visit this patch later.

Matt added a subscriber: Matt.May 23 2022, 8:40 AM