This is an archive of the discontinued LLVM Phabricator instance.

Fix a performance problem in gep(gep ...) merging
ClosedPublic

Authored by wmi on Apr 8 2015, 5:06 PM.

Details

Summary

The patch is to fix the performance problem described in https://llvm.org/bugs/show_bug.cgi?id=23163.

As described in the bug above, gep merging may have negative effect on performance, especially when the src of gep has multiple uses, so we want to make sure every target gep after the merging has lower cost than before. In the patch, we achieve it by requiring that the index values of the src and target geps to be merged are both constants.

The patch is performance neutral for spec2000 on x86 sandybridge. It improves google internal benchmarks by 0.6 on x86 sandybridge and 0.4 on x86 westmere in geomean.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 23455.Apr 8 2015, 5:06 PM
wmi retitled this revision from to Fix a performance problem in gep(gep ...) merging .
wmi updated this object.
wmi edited the test plan for this revision. (Show Details)
wmi added reviewers: nicholas, qcolombet.
wmi set the repository for this revision to rL LLVM.
wmi added a subscriber: Unknown Object (MLST).
qcolombet edited edge metadata.Apr 10 2015, 1:20 PM

Hi Wei,

Could you upload the context with your patch to ease the review?

Thanks,
-Quentin

lib/Transforms/InstCombine/InstructionCombining.cpp
1474 ↗(On Diff #23455)

Without the context, this is hard to say whether or not this is true.
In particular, if the new GEP can be folded in an addressing mode, this is good for the register pressure.

wmi updated this revision to Diff 23623.Apr 10 2015, 2:28 PM
wmi edited edge metadata.

sorry to miss the context in the diff.

Hi Wei,

What performance numbers do you have to back up your change?

Now the optimization looks too conservative too me. I may be wrong, but I would like to see evidences of that.

Thanks,
-Quentin

lib/Transforms/InstCombine/InstructionCombining.cpp
1473 ↗(On Diff #23623)

This seems too conservative. In particular, in your comment you had mentioned the number of users.
What does a heuristic based on that give?

wmi added a comment.Apr 13 2015, 3:14 PM

I didn't see much performance change in spec2000. But I saw improvements in several google internal benchmarks: 2.8% in licence plate detection, 2.5% in stitcher, 4.6% in vision and 2.7% in zippy on sandybridge.

lib/Transforms/InstCombine/InstructionCombining.cpp
1473 ↗(On Diff #23623)

At first I planed to add !Src.hasOneUse() in shouldMergeGEPs which disabled all the gep merging if the src has multiple uses.

// Replace: gep (gep %P, long B), long A, ...
// With:    T = long A+B; gep %P, T, ...

Then I think propagation will introduce extra computation only when the add insn (T = long A+B above) generated after the gep merging cannot be deleted. If GO1 and SO1 are both constants, there is no add generated for gep merging, .i.e, in this case, propagation at least introduces no harm.

So I proposed the patch here. Compared with the first proposal, if there are multiple uses but GO1 and SO1 are constants for some or all the uses, it is still beneficial to do the gep merging fully or partially. However, I saw not much performance difference for both proposals in spec2000 and internal testing.

I know a more general proposal than the second proposal here is to do gep merging only when the add expr of SO1 and GO1 can be folded into constants. However considering the two proposals have not much perf difference, I didn't try this one. But I can try that if you think it is good to try.

Thanks for the numbers Wei!

lib/Transforms/InstCombine/InstructionCombining.cpp
1473 ↗(On Diff #23623)

What about something in between:
if ((!isa<Constant>(GO1) || !isa<Constant>(SO1)) && !Src->hasOneUse())

We still keep the benefice when both are constants and we handle single user case.

wmi added a comment.Apr 13 2015, 4:12 PM

What about something in between:
if ((!isa<Constant>(GO1) || !isa<Constant>(SO1)) && !Src->hasOneUse())

We still keep the benefice when both are constants and we handle single user case.

http://reviews.llvm.org/D8911

For single user case, combining may still hurt performance sometimes
when the src gep is outside a loop and target gep is inside a loop.
gep merging can move an add instruction outside of a loop to a place
inside the loop.

I got a testcase 2.cc which can show the problem. Disabling gep
merging for it can reduce insn number in its kernel loop.

Before gep merging:

%10 = sext i32 %7 to i64
%11 = getelementptr inbounds i8, i8* %9, i64 %10  ====> the src gep
....

; <label>:16 ; preds = %14, %18

%x.1 = phi i32 [ %x.0, %14 ], [ %27, %18 ]
%17 = icmp slt i32 %x.1, %2
br i1 %17, label %18, label %28

; <label>:18 ; preds = %16

%19 = mul nsw i32 %x.1, 2
%20 = sext i32 %19 to i64
%21 = getelementptr inbounds i8, i8* %11, i64 %20  ====> the target

gep. %11 is the only use inside loop.

...
br label %16

After gep merging:

; <label>:14 ; preds = %12, %16

%x.1 = phi i32 [ %x.0, %12 ], [ %25, %16 ]
%15 = icmp slt i32 %x.1, %2
br i1 %15, label %16, label %26

; <label>:16

%17 = shl nsw i32 %x.1, 1
%18 = sext i32 %17 to i64
%.sum = add nsw i64 %9, %18   ========> the extra add inside loop.
%19 = getelementptr inbounds i8, i8* %8, i64 %.sum
...
br label %14
qcolombet accepted this revision.Apr 13 2015, 4:43 PM
qcolombet edited edge metadata.

The thing with the "both constants" approach is that we may miss some cases where an addressing mode may apply. In particular, on x86, when one argument of the add is a constant, we would be able to match reg1 + reg2 + displacement.

That being said, your numbers imply this is not a problem in practice, and most targets do not support reg1 + reg2 + reg3/imm addressing mode. Anyway, we could recover from that later on if needed.

So LGTM.

Thanks Wei!
-Quentin

This revision is now accepted and ready to land.Apr 13 2015, 4:43 PM
nicholas edited edge metadata.Apr 14 2015, 1:16 AM

Quentin Colombet wrote:

The thing with the "both constants" approach is that we may miss some cases where an addressing mode may apply. In particular, on x86, when one argument of the add is a constant, we would be able to match reg1 + reg2 + displacement.

That being said, your numbers imply this is not a problem in practice, and most targets do not support reg1 + reg2 + reg3/imm addressing mode. Anyway, we could recover from that later on if needed.

So LGTM.

I might be misunderstanding why this patch helps, please bear with me.
Is this not a special case of a general problem, mainly that we're
merging something into a loop that was previously two operations one
inside and one outside the loop? Why wouldn't this problem come up in
other ways? We already have other problems where we introduce loop
carried dependencies where ones previously didn't exist, shouldn't we
approach this by adding an API to detect those and then making
transforms conditional on that?

Nick

Thanks Wei!
-Quentin

REPOSITORY

rL LLVM

http://reviews.llvm.org/D8911

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/
meheff added a subscriber: meheff.Apr 15 2015, 11:34 AM

There was a discussion on the mailing list about the exact issue this patch is addressing. We (Google team working on NVPTX backend) ran into this same issue as did someone (Francois Pichet) working with the ppc backend. Here's the link to the discussion:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083398.html

I initially proposed what your patch does. That is, don't merge geps unless the indices are constant (in which case the add is folded so no additional instructions are potentially created in loops). However, after some discussion we came to the conclusion that the gep merging is a good canonicalization which we want to do. Instead a better approach might be to split GEPs (undoing the early GEP merge) very late, right before codegen, if it is determined to be profitable. This way, all the intermediate passes get the advantage of the canonicalization, while still being able to do LICM/CSE on GEP subexpressions which the gep merge had prevented.

I have a patch I am in the last stages of testing which will perform this optimization in the pass SeparateConstOffsetFromGEP (to be renamed ReassociateGEPs... or maybe SplitGEPs now that I think about it). I ran my patch with your example from bug 23163 and it looks to perform better than the prevent gep merging approach does. With head compiler I get 29 instructions in the inner loop (this matches what you report in the bug). You mention disabling merging reduces the loop instructions down to 25 instructions. With the patch I am working on, it gets reduced down to 16 instructions.

The pass is currently enabled only for select backends: nvptx, arm (-O3), and ppc (-O3). For testing with your bug I added it in x86 addIRPasses() followed by CSE and LICM (these extra passes do nothing on your example without the GEP splitting pass).

I should have the patch out for review later in the week.

Sorry for the duplicate effort here. I should have filed a bug on this issue before diving in :-(

Mark

wmi updated this revision to Diff 24096.Apr 20 2015, 9:58 PM
wmi edited edge metadata.
  1. updated getelementptr.ll.
  2. descale-zero.ll failed with the patch. Actually the testcase didn't go through the code added in the original patch (For the testcase, InstCombiner::Descale was not called), which showed it was an invalid testcase. So I simply removed it.
This revision was automatically updated to reflect the committed changes.
wmi added a comment.May 11 2015, 3:28 PM

I collected the stat data for spec2000. Overall, reaching the limit of
MaxLookupSearchDepth is not quite often. After we disable gep merging,
the times when the limit being reached only increase a little for two
files in spec2000. The stat data is attached.

Without the patch -- with gep merging:
reach_limit = 27, total = 81402, ratio = 0.03%, file = sv.c

With the patch -- without gep merging:
reach_limit = 28, total = 81486, ratio = 0.03%, file = sv.c
reach_limit = 8, total = 1107, ratio = 0.72%, file = tietze.c

I also tested the performance for spec2000 and internal benchmarks.
There were no performance differences.

I added the stat in -stat result. I will send it out in a separate patch.

Wei.

llvm/trunk/lib/Transforms/InstCombine/InstructionCombining.cpp