This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold (gep (oneuse(gep Ptr, Idx0)), Idx1) -> (gep Ptr, (add Idx0, Idx1)) (PR51069)
ClosedPublic

Authored by RKSimon on Jul 21 2021, 8:18 AM.

Details

Summary

As noticed on D106352, after we've folded "(select C, (gep Ptr, Idx), Ptr) -> (gep Ptr, (select C, Idx, 0))" if the inner Ptr was also a (now one use) gep we could then merge the geps, using the sum of the indices instead.

I've limited this to basic 2-op geps - a more general case further down InstCombinerImpl.visitGetElementPtrInst doesn't have the one-use limitation but only creates a add if it can be created via SimplifyAddInst.

https://alive2.llvm.org/ce/z/f8pLfD (Thanks Roman!)

Diff Detail

Event Timeline

RKSimon created this revision.Jul 21 2021, 8:18 AM
RKSimon requested review of this revision.Jul 21 2021, 8:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2021, 8:18 AM
lebedev.ri added inline comments.Jul 21 2021, 8:33 AM
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
2134

I believe this add is nsw iff NewGEP is inbounds.
alive2 timeouts/memouts even with 10min/64gb limits,
but that means it fails to disprove it, which is a good sign.

RKSimon updated this revision to Diff 360492.Jul 21 2021, 9:03 AM

Set nsw if the new gep is inbounds

lebedev.ri accepted this revision.Jul 21 2021, 9:04 AM

LG, thank you.

This revision is now accepted and ready to land.Jul 21 2021, 9:04 AM
This revision was landed with ongoing or failed builds.Jul 22 2021, 3:16 AM
This revision was automatically updated to reflect the committed changes.

FWIW i think the obvious generalization here is that only the outer GEP needs to have two operands,
we just need to fold it's index operand into the last operand of inner one-use GEP:
https://alive2.llvm.org/ce/z/UTBPnG

Or the other way around: the previous GEP could have only two operands,
and then we just need to fold it into the first index:
https://alive2.llvm.org/ce/z/xWVW8J

But i guess these aren't really interesting for the select-of-gep's, outside of the bigger pattern at least.

So i think this should be something like

if (Src->hasOneUse() &&
    GEP.getOperand(1)->getType() ==
        Src->getOperand(Src->getNumOperands() - 1)->getType()) {
  // Fold (gep(gep(Ptr,Idx0),Idx1) -> gep(Ptr,add(Idx0,Idx1))
  bool NewInBounds = isMergedGEPInBounds(*Src, *cast<GEPOperator>(&GEP));
  SmallVector<Value *, 8> Indices;
  Indices.reserve(Src->getNumIndices() + GEP.getNumIndices() - 1);
  append_range(Indices, Src->indices());
  Indices.back() = Builder.CreateAdd(
      *GEP.idx_begin(), Indices.back(), GEP.getName() + ".idx",
      /*HasNUW*/ false, /*HasNSW*/ NewInBounds);
  append_range(Indices, make_range(GEP.idx_begin() + 1, GEP.idx_end()));
  auto *NewGEP = GetElementPtrInst::Create(
      Src->getSourceElementType(), Src->getPointerOperand(), Indices);
  NewGEP->setIsInBounds(NewInBounds);
  return NewGEP;
}

but i don't think i will pursue it further than this snippet.

Cheers - I'll take a look at adding the fold

This change is causing performance degradation on one of our important benchmarks. Here is an example to show the potential reason:

@block = global i32 zeroinitializer, align 4

define void @foo(i32* %j, i32 %k) {
entry:
  br label %for.body

for.body:
  %i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %invariant0 = getelementptr i32, i32* @block, i32 %k
  %x = getelementptr i32, i32* %invariant0, i32 %i.01
  store i32 2, i32* %x, align 4
  %invariant1 = getelementptr i32, i32* @block, i32 %k
  %0 = load i32, i32* %j
  %y = getelementptr i32, i32* %invariant1, i32 %0
  store i32 1, i32* %y, align 4
  %inc = add nsw i32 %i.01, 1
  %cmp = icmp slt i32 %inc, 100
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret void
}

Without this change, %invariant0 and %invariant1 are CSE. And %invariant0 is LICM out of the loop.

bash-5.0$ opt t.ll -S -early-cse -licm
@block = global i32 0, align 4

define void @foo(i32* %j, i32 %k) {
entry:
  %invariant0 = getelementptr i32, i32* @block, i32 %k
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %x = getelementptr i32, i32* %invariant0, i32 %i.01
  store i32 2, i32* %x, align 4
  %0 = load i32, i32* %j, align 4
  %y = getelementptr i32, i32* %invariant0, i32 %0
  store i32 1, i32* %y, align 4
  %inc = add nsw i32 %i.01, 1
  %cmp = icmp slt i32 %inc, 100
  br i1 %cmp, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

After this change, %invariant0 and %invariant1 are inst-combined and turn into add instructions. There is no longer invariance instruction to LICM or common subexpressions to eliminate.

bash-5.0$ opt t.ll -S -instcombine -early-cse -licm
@block = global i32 0, align 4

define void @foo(i32* %j, i32 %k) {
entry:
  %0 = sext i32 %k to i64
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %1 = zext i32 %i.01 to i64
  %x.idx = add nsw i64 %1, %0
  %x = getelementptr i32, i32* @block, i64 %x.idx
  store i32 2, i32* %x, align 4
  %2 = load i32, i32* %j, align 4
  %3 = sext i32 %2 to i64
  %y.idx = add nsw i64 %3, %0
  %y = getelementptr i32, i32* @block, i64 %y.idx
  store i32 1, i32* %y, align 4
  %inc = add nuw nsw i32 %i.01, 1
  %cmp = icmp ult i32 %i.01, 99
  br i1 %cmp, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

The problem is more significant when the instructions are in a deeply nested loop, or the loop has a lot of iterations.

Can we either revert this change or guard it under an option and disable by default, until there is a solution?

This change is causing performance degradation on one of our important benchmarks. Here is an example to show the potential reason:

@block = global i32 zeroinitializer, align 4

define void @foo(i32* %j, i32 %k) {
entry:
  br label %for.body

for.body:
  %i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %invariant0 = getelementptr i32, i32* @block, i32 %k
  %x = getelementptr i32, i32* %invariant0, i32 %i.01
  store i32 2, i32* %x, align 4
  %invariant1 = getelementptr i32, i32* @block, i32 %k
  %0 = load i32, i32* %j
  %y = getelementptr i32, i32* %invariant1, i32 %0
  store i32 1, i32* %y, align 4
  %inc = add nsw i32 %i.01, 1
  %cmp = icmp slt i32 %inc, 100
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret void
}

Without this change, %invariant0 and %invariant1 are CSE. And %invariant0 is LICM out of the loop.

bash-5.0$ opt t.ll -S -early-cse -licm
@block = global i32 0, align 4

define void @foo(i32* %j, i32 %k) {
entry:
  %invariant0 = getelementptr i32, i32* @block, i32 %k
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %x = getelementptr i32, i32* %invariant0, i32 %i.01
  store i32 2, i32* %x, align 4
  %0 = load i32, i32* %j, align 4
  %y = getelementptr i32, i32* %invariant0, i32 %0
  store i32 1, i32* %y, align 4
  %inc = add nsw i32 %i.01, 1
  %cmp = icmp slt i32 %inc, 100
  br i1 %cmp, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

After this change, %invariant0 and %invariant1 are inst-combined and turn into add instructions. There is no longer invariance instruction to LICM or common subexpressions to eliminate.

bash-5.0$ opt t.ll -S -instcombine -early-cse -licm
@block = global i32 0, align 4

define void @foo(i32* %j, i32 %k) {
entry:
  %0 = sext i32 %k to i64
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %1 = zext i32 %i.01 to i64
  %x.idx = add nsw i64 %1, %0
  %x = getelementptr i32, i32* @block, i64 %x.idx
  store i32 2, i32* %x, align 4
  %2 = load i32, i32* %j, align 4
  %3 = sext i32 %2 to i64
  %y.idx = add nsw i64 %3, %0
  %y = getelementptr i32, i32* @block, i64 %y.idx
  store i32 1, i32* %y, align 4
  %inc = add nuw nsw i32 %i.01, 1
  %cmp = icmp ult i32 %i.01, 99
  br i1 %cmp, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

The problem is more significant when the instructions are in a deeply nested loop, or the loop has a lot of iterations.

Can we either revert this change or guard it under an option and disable by default, until there is a solution?

Generally, i do wonder if we are missing something like SeparateConstOffsetFromGEPPass from the pipelin.
But here, can you please provide an end-to-end test?
As far as i can see we always run EarlyCSE before InstCombine.

@Whitney which target is this?

@Whitney which target is this?

I was testing on a Power AIX machine.

@Whitney we need the end-to-end test case to have a better understanding of where your original IR is appearing and why cse isn't solving this - running it through -O3 seems to be fine: https://c.godbolt.org/z/jWYcra73x (vs just -instcombine) which suggests a phase ordering issue is at play

@Whitney we need the end-to-end test case to have a better understanding of where your original IR is appearing and why cse isn't solving this - running it through -O3 seems to be fine: https://c.godbolt.org/z/jWYcra73x (vs just -instcombine) which suggests a phase ordering issue is at play

We are working on another reduced LLVM IR test case that shows the problem with -O3. Unfortunately the important benchmark is huge and we are not allowed to share.

@RKSimon we've discovered that this commit is causing substantial performance regressions (~11% runtime) of some of our benchmarks on different x86 microarchitectures (haswell, skylake). @wmi is trying to get an isolated test.

@RKSimon we've discovered that this commit is causing substantial performance regressions (~11% runtime) of some of our benchmarks on different x86 microarchitectures (haswell, skylake). @wmi is trying to get an isolated test.

OK - do you know offhand if its another CSE / loop invariant screwup?

I suspect while we want to apply D107935, it's only half of the story - what if LICM didn't run yet?
We need some kind of an undo transform. (and no, SeparateConstOffsetFromGEP doesn't help, i tried)

Given that we have a test case and multiple people have reported regressions, can we revert in the meantime?

Given that we have a test case and multiple people have reported regressions, can we revert in the meantime?

Especially if there's no obvious and clear forward fix, I'd appreciate if you could unblock us by reverting for now. Thanks!

Given that we have a test case and multiple people have reported regressions, can we revert in the meantime?

Especially if there's no obvious and clear forward fix, I'd appreciate if you could unblock us by reverting for now. Thanks!

Except that regresses other benchmarks that I was addressing with this patch - bullet etc.

Given that we have a test case and multiple people have reported regressions, can we revert in the meantime?

Especially if there's no obvious and clear forward fix, I'd appreciate if you could unblock us by reverting for now. Thanks!

Except that regresses other benchmarks that I was addressing with this patch - bullet etc.

Is the regression fixed by this patch more serious than the one introduced by it? Is it clear on how to fix the new regression?

Given that we have a test case and multiple people have reported regressions, can we revert in the meantime?

Especially if there's no obvious and clear forward fix, I'd appreciate if you could unblock us by reverting for now. Thanks!

Except that regresses other benchmarks that I was addressing with this patch - bullet etc.

Is the regression fixed by this patch more serious than the one introduced by it? Is it clear on how to fix the new regression?

I believe Bullet > “two internal *micro*benchmarks”

I'm going to be on PTO again soon so I'm going to revert this change to unblock you guys, I'll revisit this when I get back - for x86 at least I believe we'll be able to handle it with basic CMOV handling inside X86DAGToDAGISel::matchAddressRecursively.

But for @Whitney's AIX case I'm still not sure.

Given that we have a test case and multiple people have reported regressions, can we revert in the meantime?

Especially if there's no obvious and clear forward fix, I'd appreciate if you could unblock us by reverting for now. Thanks!

Except that regresses other benchmarks that I was addressing with this patch - bullet etc.

Is the regression fixed by this patch more serious than the one introduced by it? Is it clear on how to fix the new regression?

I believe Bullet > “two internal *micro*benchmarks”

FWIW it's not a micro benchmark per se. It's basically compression/decompression algorithms around this code base: https://github.com/google/snappy, the particular data corpus is internal, but it was significant enough and widespread across a number of other benchmarks as well. In addition it seemed to exist across multiple microarchitectures (and we've seen it on Power above and I'm going to hypothesize we'd see it on ARM). Simon's comments feel right around where the problem is likely coming from as well.

Thanks for more details, yes - snappy is important.

Please can somebody confirm that rG10c982e0b3e6d46d1fe288d7dbe0a393c65a640f reverts the regressions you were seeing

Please can somebody confirm that rG10c982e0b3e6d46d1fe288d7dbe0a393c65a640f reverts the regressions you were seeing

Thanks! We'll get performance testing results tomorrow.

Please can somebody confirm that rG10c982e0b3e6d46d1fe288d7dbe0a393c65a640f reverts the regressions you were seeing

Confirming that performance regressions are gone after rG10c982e0b3e6d46d1fe288d7dbe0a393c65a640f. Thanks again!