This is an archive of the discontinued LLVM Phabricator instance.

[NewPM] Add an SROA pass after loop unroll
ClosedPublic

Authored by Carrot on Oct 7 2019, 2:14 PM.

Download Raw Diff

Details

Reviewers

chandlerc
tejohnson
MaskRay

Commits

rGcecc0d27ad58: [NewPM] Add an SROA pass after loop unroll

Summary

In tensorflow library we found llvm generates redundant memory accesses to local array. It can also be demonstrated by following test case

#include <memory.h>

constexpr int size=4;

void f(int *a,int * b) {

float tmp[size];
for(int i =0;i<size;i++) {
    tmp[i] = a[i];
}
memcpy(b,tmp,size*sizeof(int));
return;

}

LLVM generates:

movups  (%rdi), %xmm0
cvtdq2ps        %xmm0, %xmm0
movaps  %xmm0, -24(%rsp)             // *
movaps  -24(%rsp), %xmm0             // *
movups  %xmm0, (%rsi)
retq

The reason is SROA can't handle memory accesses with variant offset inside a loop, after the loop is fully unrolled, all memory accesses to the array are with fixed offset, so now they can be processed by SROA. But there is no more SROA passes after loop unroll. This patch add an SROA pass after loop unroll to handle this pattern.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Carrot created this revision.Oct 7 2019, 2:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 2:14 PM

Herald added subscribers: llvm-commits, dexonsmith, steven_wu and 2 others. · View Herald Transcript

ping

Carrot added a reviewer: tejohnson.Oct 29 2019, 9:58 AM

wmi added a subscriber: wmi.Oct 29 2019, 10:31 AM

Looks like a very helpful patch. I saw the problem showing up at multiple places when I was looking at a halide testcase and I was wondering what was wrong. I can try it out and see if the patch can fix it.

Could you evaluate the compilation time impact of the patch? If there is measurable compilation time increase, you may consider only adding it to O3.

+ 1 for late SROA pass. Can you add this test case too?

You should answer obvious questions: any compile time change? performance change?

I tried to build spec2006int, the time changes from

real 5m54.943s
user 89m17.735s
sys 1m13.810s

real 6m0.082s
user 89m30.166s
sys 1m10.192s

So 0.2% difference.

Performance is still running.

In D68593#1725683, @wmi wrote:

Looks like a very helpful patch. I saw the problem showing up at multiple places when I was looking at a halide testcase and I was wondering what was wrong. I can try it out and see if the patch can fix it.

I tried it and found it was a different issue in the test I looked at. Anyway, it still looks helpful.

This is a problem I have noticed. This should probably add a testcase showing the benefit

Add a new test case.

Also tested with spec2006int, result changes from 39.0 to 39.2.

Herald added a subscriber: zzheng. · View Herald TranscriptOct 30 2019, 3:13 PM

Nice perf improvement :)

This patch should be accepted.

lgtm

This revision is now accepted and ready to land.Oct 31 2019, 6:09 AM

In the description, you can indent the whole code block by 2, then Phabricator will nicely present it.

Closed by commit rGcecc0d27ad58: [NewPM] Add an SROA pass after loop unroll (authored by Carrot). · Explain WhyNov 1 2019, 3:02 PM

This revision was automatically updated to reflect the committed changes.

Our bots started failing after this change landed with the following error:

******************** TEST 'Clang :: CodeGenCXX/union-tbaa2.cpp' FAILED ********************
Script:
--
: 'RUN: at line 1';   /b/s/w/ir/k/recipe_cleanup/clangSkDsbg/llvm_build_dir/bin/clang -cc1 -internal-isystem /b/s/w/ir/k/recipe_cleanup/clangSkDsbg/llvm_build_dir/lib/clang/10.0.0/include -nostdsysteminc /b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp -O2 -std=c++11 -triple x86_64-unknown-linux-gnu -target-cpu x86-64 -target-feature +sse4.2 -target-feature +avx -emit-llvm -o - | /b/s/w/ir/k/recipe_cleanup/clangSkDsbg/llvm_build_dir/bin/FileCheck /b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp
--
Exit Code: 1

Command Output (stderr):
--
/b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp:19:11: error: CHECK: expected string not found in input
// CHECK: store <4 x double>
          ^
<stdin>:1:1: note: scanning from here
; ModuleID = '/b/s/w/ir/k/llvm-project/clang/test/CodeGenCXX/union-tbaa2.cpp'
^
<stdin>:6:33: note: possible intended match here
@.str = private unnamed_addr constant [4 x i8] c"%f \00", align 1
                                ^

--

Our toolchain uses new PM by default which is presumably why it hasn't manifested on other bots. Can you quickly fix this or is it OK to revert the change?

Try to add

-fno-experimental-new-pass-manager

to that test.

I see no reason to revert this since upstream LLVM works.

MaskRay mentioned this in rGe0b3a8c99156: [CodeGenCXX][test] Use -fno-experimental-new-pass-manager for CodeGenCXX/union….Nov 2 2019, 4:07 PM

phosek mentioned this in D69757: [CodeGenCXX] Don't use new PM with union-tbaa2 test.Nov 2 2019, 4:12 PM

xbolva00 mentioned this in D87972: [OldPM] Pass manager: run SROA after (simple) loop unrolling.Sep 20 2020, 4:38 AM

lebedev.ri mentioned this in rG03bd5198b6f7: [OldPM] Pass manager: run SROA after (simple) loop unrolling.Oct 4 2020, 1:54 AM

dongAxis1944 added a subscriber: dongAxis1944.Oct 28 2020, 4:04 AM

Revision Contents

Path

Size

llvm/

lib/

Passes/

PassBuilder.cpp

3 lines

test/

Other/

new-pm-defaults.ll

1 line

new-pm-thinlto-defaults.ll

1 line

unroll-sroa.ll

61 lines

Diff 227544

llvm/lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
FPM.addPass(SimplifyCFGPass());		FPM.addPass(SimplifyCFGPass());
FPM.addPass(InstCombinePass());		FPM.addPass(InstCombinePass());
// The loop passes in LPM2 (IndVarSimplifyPass, LoopIdiomRecognizePass,		// The loop passes in LPM2 (IndVarSimplifyPass, LoopIdiomRecognizePass,
// LoopDeletionPass and LoopFullUnrollPass) do not preserve MemorySSA.		// LoopDeletionPass and LoopFullUnrollPass) do not preserve MemorySSA.
// All loop passes must preserve it, in order to be able to use it.		// All loop passes must preserve it, in order to be able to use it.
FPM.addPass(createFunctionToLoopPassAdaptor(		FPM.addPass(createFunctionToLoopPassAdaptor(
std::move(LPM2), /UseMemorySSA=/false, DebugLogging));		std::move(LPM2), /UseMemorySSA=/false, DebugLogging));

		// Delete small array after loop unroll.
		FPM.addPass(SROA());

// Eliminate redundancies.		// Eliminate redundancies.
if (Level != O1) {		if (Level != O1) {
// These passes add substantial compile time so skip them at O1.		// These passes add substantial compile time so skip them at O1.
FPM.addPass(MergedLoadStoreMotionPass());		FPM.addPass(MergedLoadStoreMotionPass());
if (RunNewGVN)		if (RunNewGVN)
FPM.addPass(NewGVNPass());		FPM.addPass(NewGVNPass());
else		else
FPM.addPass(GVN());		FPM.addPass(GVN());
▲ Show 20 Lines • Show All 1,884 Lines • Show Last 20 Lines

llvm/test/Other/new-pm-defaults.ll

	Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines
	; CHECK-O-NEXT: Starting Loop pass manager run.			; CHECK-O-NEXT: Starting Loop pass manager run.
	; CHECK-O-NEXT: Running pass: IndVarSimplifyPass			; CHECK-O-NEXT: Running pass: IndVarSimplifyPass
	; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass			; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass
	; CHECK-EP-LOOP-LATE-NEXT: Running pass: NoOpLoopPass			; CHECK-EP-LOOP-LATE-NEXT: Running pass: NoOpLoopPass
	; CHECK-O-NEXT: Running pass: LoopDeletionPass			; CHECK-O-NEXT: Running pass: LoopDeletionPass
	; CHECK-O-NEXT: Running pass: LoopFullUnrollPass			; CHECK-O-NEXT: Running pass: LoopFullUnrollPass
	; CHECK-EP-LOOP-END-NEXT: Running pass: NoOpLoopPass			; CHECK-EP-LOOP-END-NEXT: Running pass: NoOpLoopPass
	; CHECK-O-NEXT: Finished Loop pass manager run.			; CHECK-O-NEXT: Finished Loop pass manager run.
				; CHECK-O-NEXT: Running pass: SROA on foo
	; CHECK-Os-NEXT: Running pass: MergedLoadStoreMotionPass			; CHECK-Os-NEXT: Running pass: MergedLoadStoreMotionPass
	; CHECK-Os-NEXT: Running pass: GVN			; CHECK-Os-NEXT: Running pass: GVN
	; CHECK-Os-NEXT: Running analysis: MemoryDependenceAnalysis			; CHECK-Os-NEXT: Running analysis: MemoryDependenceAnalysis
	; CHECK-Os-NEXT: Running analysis: PhiValuesAnalysis			; CHECK-Os-NEXT: Running analysis: PhiValuesAnalysis
	; CHECK-Oz-NEXT: Running pass: MergedLoadStoreMotionPass			; CHECK-Oz-NEXT: Running pass: MergedLoadStoreMotionPass
	; CHECK-Oz-NEXT: Running pass: GVN			; CHECK-Oz-NEXT: Running pass: GVN
	; CHECK-Oz-NEXT: Running analysis: MemoryDependenceAnalysis			; CHECK-Oz-NEXT: Running analysis: MemoryDependenceAnalysis
	; CHECK-Oz-NEXT: Running analysis: PhiValuesAnalysis			; CHECK-Oz-NEXT: Running analysis: PhiValuesAnalysis
	▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

llvm/test/Other/new-pm-thinlto-defaults.ll

	Show First 20 Lines • Show All 150 Lines • ▼ Show 20 Lines
	; CHECK-O-NEXT: Running pass: LCSSAPass			; CHECK-O-NEXT: Running pass: LCSSAPass
	; CHECK-O-NEXT: Finished llvm::Function pass manager run			; CHECK-O-NEXT: Finished llvm::Function pass manager run
	; CHECK-O-NEXT: Starting Loop pass manager run.			; CHECK-O-NEXT: Starting Loop pass manager run.
	; CHECK-O-NEXT: Running pass: IndVarSimplifyPass			; CHECK-O-NEXT: Running pass: IndVarSimplifyPass
	; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass			; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass
	; CHECK-O-NEXT: Running pass: LoopDeletionPass			; CHECK-O-NEXT: Running pass: LoopDeletionPass
	; CHECK-O-NEXT: Running pass: LoopFullUnrollPass			; CHECK-O-NEXT: Running pass: LoopFullUnrollPass
	; CHECK-O-NEXT: Finished Loop pass manager run.			; CHECK-O-NEXT: Finished Loop pass manager run.
				; CHECK-O-NEXT: Running pass: SROA on foo
	; CHECK-Os-NEXT: Running pass: MergedLoadStoreMotionPass			; CHECK-Os-NEXT: Running pass: MergedLoadStoreMotionPass
	; CHECK-Os-NEXT: Running pass: GVN			; CHECK-Os-NEXT: Running pass: GVN
	; CHECK-Os-NEXT: Running analysis: MemoryDependenceAnalysis			; CHECK-Os-NEXT: Running analysis: MemoryDependenceAnalysis
	; CHECK-Os-NEXT: Running analysis: PhiValuesAnalysis			; CHECK-Os-NEXT: Running analysis: PhiValuesAnalysis
	; CHECK-Oz-NEXT: Running pass: MergedLoadStoreMotionPass			; CHECK-Oz-NEXT: Running pass: MergedLoadStoreMotionPass
	; CHECK-Oz-NEXT: Running pass: GVN			; CHECK-Oz-NEXT: Running pass: GVN
	; CHECK-Oz-NEXT: Running analysis: MemoryDependenceAnalysis			; CHECK-Oz-NEXT: Running analysis: MemoryDependenceAnalysis
	; CHECK-Oz-NEXT: Running analysis: PhiValuesAnalysis			; CHECK-Oz-NEXT: Running analysis: PhiValuesAnalysis
	▲ Show 20 Lines • Show All 110 Lines • Show Last 20 Lines

llvm/test/Other/unroll-sroa.ll

This file was added.

				; RUN: opt -disable-verify -passes='default<O2>' -S < %s \| FileCheck %s

				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; The local array %tmp can only be optimized away by sroa after loop unroll.

				; CHECK-LABEL: define void @foo
				; CHECK-NOT: alloca
				; CHECK-NOT: call void @llvm.memcpy.p0i8.p0i8.i64

				; Function Attrs: nounwind uwtable
				define void @foo(i32* %a, i32* %b) {
				entry:
				%a.addr = alloca i32*, align 8
				%b.addr = alloca i32*, align 8
				%tmp = alloca [4 x float], align 16
				%i = alloca i32, align 4
				store i32* %a, i32** %a.addr, align 8
				store i32* %b, i32** %b.addr, align 8
				store i32 0, i32* %i, align 4
				br label %for.cond

				for.cond: ; preds = %for.inc, %entry
				%iter2 = load i32, i32* %i, align 4
				%cmp = icmp slt i32 %iter2, 4
				br i1 %cmp, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond
				br label %for.end

				for.body: ; preds = %for.cond
				%inptr = load i32, i32* %a.addr, align 8
				%idx2 = load i32, i32* %i, align 4
				%idxprom = sext i32 %idx2 to i64
				%arrayidx = getelementptr inbounds i32, i32* %inptr, i64 %idxprom
				%val = load i32, i32* %arrayidx, align 4
				%conv = sitofp i32 %val to float
				%idx = load i32, i32* %i, align 4
				%idxprom1 = sext i32 %idx to i64
				%arrayidx2 = getelementptr inbounds [4 x float], [4 x float]* %tmp, i64 0, i64 %idxprom1
				store float %conv, float* %arrayidx2, align 4
				br label %for.inc

				for.inc: ; preds = %for.body
				%iter = load i32, i32* %i, align 4
				%inc = add nsw i32 %iter, 1
				store i32 %inc, i32* %i, align 4
				br label %for.cond

				for.end: ; preds = %for.cond.cleanup
				%dstptr = load i32, i32* %b.addr, align 8
				%dst = bitcast i32* %dstptr to i8*
				%arraydecay = getelementptr inbounds [4 x float], [4 x float]* %tmp, i64 0, i64 0
				%src = bitcast float* %arraydecay to i8*
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %dst, i8* align 16 %src, i64 16, i1 false)
				ret void
				}

				; Function Attrs: argmemonly nounwind willreturn
				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg)