This is an archive of the discontinued LLVM Phabricator instance.

[NewPM] Add an SROA pass after loop unroll
ClosedPublic

Authored by Carrot on Oct 7 2019, 2:14 PM.

Details

Summary

In tensorflow library we found llvm generates redundant memory accesses to local array. It can also be demonstrated by following test case

#include <memory.h>

constexpr int size=4;

void f(int *a,int * b) {

float tmp[size];
for(int i =0;i<size;i++) {
    tmp[i] = a[i];
}
memcpy(b,tmp,size*sizeof(int));
return;

}

LLVM generates:

movups  (%rdi), %xmm0
cvtdq2ps        %xmm0, %xmm0
movaps  %xmm0, -24(%rsp)             // *
movaps  -24(%rsp), %xmm0             // *
movups  %xmm0, (%rsi)
retq

The reason is SROA can't handle memory accesses with variant offset inside a loop, after the loop is fully unrolled, all memory accesses to the array are with fixed offset, so now they can be processed by SROA. But there is no more SROA passes after loop unroll. This patch add an SROA pass after loop unroll to handle this pattern.

Diff Detail

Event Timeline

Carrot created this revision.Oct 7 2019, 2:14 PM
wmi added a subscriber: wmi.Oct 29 2019, 10:31 AM
wmi added a comment.Oct 29 2019, 10:39 AM

Looks like a very helpful patch. I saw the problem showing up at multiple places when I was looking at a halide testcase and I was wondering what was wrong. I can try it out and see if the patch can fix it.

Could you evaluate the compilation time impact of the patch? If there is measurable compilation time increase, you may consider only adding it to O3.

+ 1 for late SROA pass. Can you add this test case too?

You should answer obvious questions: any compile time change? performance change?

I tried to build spec2006int, the time changes from

real 5m54.943s
user 89m17.735s
sys 1m13.810s

to

real 6m0.082s
user 89m30.166s
sys 1m10.192s

So 0.2% difference.

Performance is still running.

wmi added a comment.Oct 29 2019, 12:09 PM
In D68593#1725683, @wmi wrote:

Looks like a very helpful patch. I saw the problem showing up at multiple places when I was looking at a halide testcase and I was wondering what was wrong. I can try it out and see if the patch can fix it.

I tried it and found it was a different issue in the test I looked at. Anyway, it still looks helpful.

arsenm added a subscriber: arsenm.Oct 29 2019, 12:11 PM

This is a problem I have noticed. This should probably add a testcase showing the benefit

Carrot updated this revision to Diff 227185.Oct 30 2019, 3:13 PM

Add a new test case.

Also tested with spec2006int, result changes from 39.0 to 39.2.

Nice perf improvement :)

This patch should be accepted.

This revision is now accepted and ready to land.Oct 31 2019, 6:09 AM
MaskRay accepted this revision.Oct 31 2019, 8:23 AM
MaskRay added a subscriber: MaskRay.

In the description, you can indent the whole code block by 2, then Phabricator will nicely present it.

This revision was automatically updated to reflect the committed changes.
phosek added a subscriber: phosek.Nov 2 2019, 3:35 PM

Our bots started failing after this change landed with the following error:

******************** TEST 'Clang :: CodeGenCXX/union-tbaa2.cpp' FAILED ********************
Script:
--
: 'RUN: at line 1';   /b/s/w/ir/k/recipe_cleanup/clangSkDsbg/llvm_build_dir/bin/clang -cc1 -internal-isystem /b/s/w/ir/k/recipe_cleanup/clangSkDsbg/llvm_build_dir/lib/clang/10.0.0/include -nostdsysteminc /b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp -O2 -std=c++11 -triple x86_64-unknown-linux-gnu -target-cpu x86-64 -target-feature +sse4.2 -target-feature +avx -emit-llvm -o - | /b/s/w/ir/k/recipe_cleanup/clangSkDsbg/llvm_build_dir/bin/FileCheck /b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp
--
Exit Code: 1

Command Output (stderr):
--
/b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp:19:11: error: CHECK: expected string not found in input
// CHECK: store <4 x double>
          ^
<stdin>:1:1: note: scanning from here
; ModuleID = '/b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp'
^
<stdin>:6:33: note: possible intended match here
@.str = private unnamed_addr constant [4 x i8] c"%f \00", align 1
                                ^

--

Our toolchain uses new PM by default which is presumably why it hasn't manifested on other bots. Can you quickly fix this or is it OK to revert the change?

Try to add

-fno-experimental-new-pass-manager

to that test.

I see no reason to revert this since upstream LLVM works.