In tensorflow library we found llvm generates redundant memory accesses to local array. It can also be demonstrated by following test case
#include <memory.h>
constexpr int size=4;
void f(int *a,int * b) {
float tmp[size]; for(int i =0;i<size;i++) { tmp[i] = a[i]; } memcpy(b,tmp,size*sizeof(int)); return;
}
LLVM generates:
movups (%rdi), %xmm0 cvtdq2ps %xmm0, %xmm0 movaps %xmm0, -24(%rsp) // * movaps -24(%rsp), %xmm0 // * movups %xmm0, (%rsi) retq
The reason is SROA can't handle memory accesses with variant offset inside a loop, after the loop is fully unrolled, all memory accesses to the array are with fixed offset, so now they can be processed by SROA. But there is no more SROA passes after loop unroll. This patch add an SROA pass after loop unroll to handle this pattern.