This is an archive of the discontinued LLVM Phabricator instance.

Fix a performance problem in gep(gep ...) merging
ClosedPublic

Authored by wmi on Apr 8 2015, 5:06 PM.

Download Raw Diff

Details

Reviewers

qcolombet
nicholas

Commits

rGa0adf9fd41de: Limiting gep merging to fix the performance problem described in https://llvm.
rL235455: Limiting gep merging to fix the performance problem described in

Summary

The patch is to fix the performance problem described in https://llvm.org/bugs/show_bug.cgi?id=23163.

As described in the bug above, gep merging may have negative effect on performance, especially when the src of gep has multiple uses, so we want to make sure every target gep after the merging has lower cost than before. In the patch, we achieve it by requiring that the index values of the src and target geps to be merged are both constants.

The patch is performance neutral for spec2000 on x86 sandybridge. It improves google internal benchmarks by 0.6 on x86 sandybridge and 0.4 on x86 westmere in geomean.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 23455.Apr 8 2015, 5:06 PM

wmi retitled this revision from to Fix a performance problem in gep(gep ...) merging .

wmi updated this object.

wmi edited the test plan for this revision. (Show Details)

wmi added reviewers: nicholas, qcolombet.

wmi set the repository for this revision to rL LLVM.

wmi added a subscriber: Unknown Object (MLST).

Hi Wei,

Could you upload the context with your patch to ease the review?

Thanks,
-Quentin

lib/Transforms/InstCombine/InstructionCombining.cpp
1474 ↗	(On Diff #23455)	Without the context, this is hard to say whether or not this is true. In particular, if the new GEP can be folded in an addressing mode, this is good for the register pressure.

sorry to miss the context in the diff.

Hi Wei,

What performance numbers do you have to back up your change?

Now the optimization looks too conservative too me. I may be wrong, but I would like to see evidences of that.

Thanks,
-Quentin

lib/Transforms/InstCombine/InstructionCombining.cpp
1473 ↗	(On Diff #23623)	This seems too conservative. In particular, in your comment you had mentioned the number of users. What does a heuristic based on that give?

I didn't see much performance change in spec2000. But I saw improvements in several google internal benchmarks: 2.8% in licence plate detection, 2.5% in stitcher, 4.6% in vision and 2.7% in zippy on sandybridge.

lib/Transforms/InstCombine/InstructionCombining.cpp
1473 ↗	(On Diff #23623)	At first I planed to add !Src.hasOneUse() in shouldMergeGEPs which disabled all the gep merging if the src has multiple uses. // Replace: gep (gep %P, long B), long A, ... // With: T = long A+B; gep %P, T, ... Then I think propagation will introduce extra computation only when the add insn (T = long A+B above) generated after the gep merging cannot be deleted. If GO1 and SO1 are both constants, there is no add generated for gep merging, .i.e, in this case, propagation at least introduces no harm. So I proposed the patch here. Compared with the first proposal, if there are multiple uses but GO1 and SO1 are constants for some or all the uses, it is still beneficial to do the gep merging fully or partially. However, I saw not much performance difference for both proposals in spec2000 and internal testing. I know a more general proposal than the second proposal here is to do gep merging only when the add expr of SO1 and GO1 can be folded into constants. However considering the two proposals have not much perf difference, I didn't try this one. But I can try that if you think it is good to try.

Thanks for the numbers Wei!

lib/Transforms/InstCombine/InstructionCombining.cpp
1473 ↗	(On Diff #23623)	What about something in between: if ((!isa<Constant>(GO1) \|\| !isa<Constant>(SO1)) && !Src->hasOneUse()) We still keep the benefice when both are constants and we handle single user case.

What about something in between:
if ((!isa<Constant>(GO1) || !isa<Constant>(SO1)) && !Src->hasOneUse())

We still keep the benefice when both are constants and we handle single user case.

http://reviews.llvm.org/D8911

For single user case, combining may still hurt performance sometimes
when the src gep is outside a loop and target gep is inside a loop.
gep merging can move an add instruction outside of a loop to a place
inside the loop.

I got a testcase 2.cc which can show the problem. Disabling gep
merging for it can reduce insn number in its kernel loop.

Before gep merging:

%10 = sext i32 %7 to i64
%11 = getelementptr inbounds i8, i8* %9, i64 %10  ====> the src gep
....

; <label>:16 ; preds = %14, %18

%x.1 = phi i32 [ %x.0, %14 ], [ %27, %18 ]
%17 = icmp slt i32 %x.1, %2
br i1 %17, label %18, label %28

; <label>:18 ; preds = %16

%19 = mul nsw i32 %x.1, 2
%20 = sext i32 %19 to i64
%21 = getelementptr inbounds i8, i8* %11, i64 %20  ====> the target

gep. %11 is the only use inside loop.

...
br label %16

After gep merging:

; <label>:14 ; preds = %12, %16

%x.1 = phi i32 [ %x.0, %12 ], [ %25, %16 ]
%15 = icmp slt i32 %x.1, %2
br i1 %15, label %16, label %26

; <label>:16

%17 = shl nsw i32 %x.1, 1
%18 = sext i32 %17 to i64
%.sum = add nsw i64 %9, %18   ========> the extra add inside loop.
%19 = getelementptr inbounds i8, i8* %8, i64 %.sum
...
br label %14

2.cc639 BDownload

The thing with the "both constants" approach is that we may miss some cases where an addressing mode may apply. In particular, on x86, when one argument of the add is a constant, we would be able to match reg1 + reg2 + displacement.

That being said, your numbers imply this is not a problem in practice, and most targets do not support reg1 + reg2 + reg3/imm addressing mode. Anyway, we could recover from that later on if needed.

So LGTM.

Thanks Wei!
-Quentin

This revision is now accepted and ready to land.Apr 13 2015, 4:43 PM

Quentin Colombet wrote:

The thing with the "both constants" approach is that we may miss some cases where an addressing mode may apply. In particular, on x86, when one argument of the add is a constant, we would be able to match reg1 + reg2 + displacement.

That being said, your numbers imply this is not a problem in practice, and most targets do not support reg1 + reg2 + reg3/imm addressing mode. Anyway, we could recover from that later on if needed.

So LGTM.

I might be misunderstanding why this patch helps, please bear with me.
Is this not a special case of a general problem, mainly that we're
merging something into a loop that was previously two operations one
inside and one outside the loop? Why wouldn't this problem come up in
other ways? We already have other problems where we introduce loop
carried dependencies where ones previously didn't exist, shouldn't we
approach this by adding an API to detect those and then making
transforms conditional on that?

Nick

Thanks Wei!
-Quentin

REPOSITORY
rL LLVM
http://reviews.llvm.org/D8911

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/

There was a discussion on the mailing list about the exact issue this patch is addressing. We (Google team working on NVPTX backend) ran into this same issue as did someone (Francois Pichet) working with the ppc backend. Here's the link to the discussion:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083398.html

I initially proposed what your patch does. That is, don't merge geps unless the indices are constant (in which case the add is folded so no additional instructions are potentially created in loops). However, after some discussion we came to the conclusion that the gep merging is a good canonicalization which we want to do. Instead a better approach might be to split GEPs (undoing the early GEP merge) very late, right before codegen, if it is determined to be profitable. This way, all the intermediate passes get the advantage of the canonicalization, while still being able to do LICM/CSE on GEP subexpressions which the gep merge had prevented.

I have a patch I am in the last stages of testing which will perform this optimization in the pass SeparateConstOffsetFromGEP (to be renamed ReassociateGEPs... or maybe SplitGEPs now that I think about it). I ran my patch with your example from bug 23163 and it looks to perform better than the prevent gep merging approach does. With head compiler I get 29 instructions in the inner loop (this matches what you report in the bug). You mention disabling merging reduces the loop instructions down to 25 instructions. With the patch I am working on, it gets reduced down to 16 instructions.

The pass is currently enabled only for select backends: nvptx, arm (-O3), and ppc (-O3). For testing with your bug I added it in x86 addIRPasses() followed by CSE and LICM (these extra passes do nothing on your example without the GEP splitting pass).

I should have the patch out for review later in the week.

Sorry for the duplicate effort here. I should have filed a bug on this issue before diving in :-(

Mark

updated getelementptr.ll.
descale-zero.ll failed with the patch. Actually the testcase didn't go through the code added in the original patch (For the testcase, InstCombiner::Descale was not called), which showed it was an invalid testcase. So I simply removed it.

Closed by commit rL235455: Limiting gep merging to fix the performance problem described in (authored by wmi). · Explain WhyApr 21 2015, 4:05 PM

This revision was automatically updated to reflect the committed changes.

I collected the stat data for spec2000. Overall, reaching the limit of
MaxLookupSearchDepth is not quite often. After we disable gep merging,
the times when the limit being reached only increase a little for two
files in spec2000. The stat data is attached.

Without the patch -- with gep merging:
reach_limit = 27, total = 81402, ratio = 0.03%, file = sv.c

With the patch -- without gep merging:
reach_limit = 28, total = 81486, ratio = 0.03%, file = sv.c
reach_limit = 8, total = 1107, ratio = 0.72%, file = tietze.c

I also tested the performance for spec2000 and internal benchmarks.
There were no performance differences.

I added the stat in -stat result. I will send it out in a separate patch.

Wei.

disable.gepmerging.stat2 KBDownload
gepmerging.stat2 KBDownload

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstructionCombining.cpp

5 lines

test/

Transforms/

InstCombine/

descale-zero.ll

20 lines

getelementptr.ll

4 lines

Diff 24185

llvm/trunk/lib/Transforms/InstCombine/InstructionCombining.cpp

Show First 20 Lines • Show All 1,461 Lines • ▼ Show 20 Lines	if (EndsWithSequential) {
Sum = SO1;		Sum = SO1;
} else {		} else {
// If they aren't the same type, then the input hasn't been processed		// If they aren't the same type, then the input hasn't been processed
// by the loop above yet (which canonicalizes sequential index types to		// by the loop above yet (which canonicalizes sequential index types to
// intptr_t). Just avoid transforming this until the input has been		// intptr_t). Just avoid transforming this until the input has been
// normalized.		// normalized.
if (SO1->getType() != GO1->getType())		if (SO1->getType() != GO1->getType())
return nullptr;		return nullptr;
		// Only do the combine when GO1 and SO1 are both constants. Only in
		// this case, we are sure the cost after the merge is never more than
		// that before the merge.
		if (!isa<Constant>(GO1) \|\| !isa<Constant>(SO1))
		return nullptr;
Sum = Builder->CreateAdd(SO1, GO1, PtrOp->getName()+".sum");		Sum = Builder->CreateAdd(SO1, GO1, PtrOp->getName()+".sum");
}		}

// Update the GEP in place if possible.		// Update the GEP in place if possible.
if (Src->getNumOperands() == 2) {		if (Src->getNumOperands() == 2) {
GEP.setOperand(0, Src->getOperand(0));		GEP.setOperand(0, Src->getOperand(0));
GEP.setOperand(1, Sum);		GEP.setOperand(1, Sum);
return &GEP;		return &GEP;
▲ Show 20 Lines • Show All 1,582 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/descale-zero.ll

	; RUN: opt < %s -instcombine -S \| FileCheck %s

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.10.0"

	define internal i8* @descale_zero() {
	entry:
	; CHECK: load i8, i8* inttoptr (i64 48 to i8**), align 16
	; CHECK-NEXT: ret i8*
	%i16_ptr = load i16, i16* inttoptr (i64 48 to i16**), align 16
	%num = load i64, i64* inttoptr (i64 64 to i64*), align 64
	%num_times_2 = shl i64 %num, 1
	%num_times_2_plus_4 = add i64 %num_times_2, 4
	%i8_ptr = bitcast i16* %i16_ptr to i8*
	%i8_ptr_num_times_2_plus_4 = getelementptr i8, i8* %i8_ptr, i64 %num_times_2_plus_4
	%num_times_neg2 = mul i64 %num, -2
	%num_times_neg2_minus_4 = add i64 %num_times_neg2, -4
	%addr = getelementptr i8, i8* %i8_ptr_num_times_2_plus_4, i64 %num_times_neg2_minus_4
	ret i8* %addr
	}

llvm/trunk/test/Transforms/InstCombine/getelementptr.ll

Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret void
ret void		ret void
}		}

define i32* @test7(i32* %I, i64 %C, i64 %D) {		define i32* @test7(i32* %I, i64 %C, i64 %D) {
%A = getelementptr i32, i32* %I, i64 %C		%A = getelementptr i32, i32* %I, i64 %C
%B = getelementptr i32, i32* %A, i64 %D		%B = getelementptr i32, i32* %A, i64 %D
ret i32* %B		ret i32* %B
; CHECK-LABEL: @test7(		; CHECK-LABEL: @test7(
; CHECK: %A.sum = add i64 %C, %D		; CHECK: %A = getelementptr i32, i32* %I, i64 %C
; CHECK: getelementptr i32, i32* %I, i64 %A.sum		; CHECK: %B = getelementptr i32, i32* %A, i64 %D
}		}

define i8* @test8([10 x i32]* %X) {		define i8* @test8([10 x i32]* %X) {
;; Fold into the cast.		;; Fold into the cast.
%A = getelementptr [10 x i32], [10 x i32]* %X, i64 0, i64 0		%A = getelementptr [10 x i32], [10 x i32]* %X, i64 0, i64 0
%B = bitcast i32* %A to i8*		%B = bitcast i32* %A to i8*
ret i8* %B		ret i8* %B
; CHECK-LABEL: @test8(		; CHECK-LABEL: @test8(
▲ Show 20 Lines • Show All 794 Lines • Show Last 20 Lines