This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
-
SeparateConstOffsetFromGEP.cpp
-
test/
-
CodeGen/AArch64/
-
AArch64/
1/2
aarch64-loop-gep-opt.ll
-
Transforms/SeparateConstOffsetFromGEP/RISCV/
-
SeparateConstOffsetFromGEP/
-
RISCV/
-
split-gep.ll

Differential D135957

[AArch64][SeperateConstOffsetFromGEP] Prevent pass from splitting GEP if an index has more than one use
AbandonedPublic

Authored by zjaffal on Oct 14 2022, 4:11 AM.

Download Raw Diff

Details

Reviewers

fhahn
t.p.northover
spatel
dmgreen

Summary

This fixes a 5% performance regression in CPython for some inputs by fixing codegen for any_find_slice in unicodeobject.c
https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L8748

When enabling -aarch64-enable-gep-opt the transform splits GEP instruction generating unecessary instructions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

zjaffal created this revision.Oct 14 2022, 4:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2022, 4:11 AM

Herald added subscribers: arphaman, hiraditya, kristof.beyls. · View Herald Transcript

zjaffal requested review of this revision.Oct 14 2022, 4:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2022, 4:11 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B192155: Diff 467737.Oct 14 2022, 4:52 AM

Hmm. I somewhat regret turning this option on, as I've seen other issues elsewhere too. It seems like it is still a decent improvement in all the testing I've done though.

There is a RISCV test that is failing.

Can you explain more why this is the correct fix? Why is it limited to 2 operand geps? Do you have benchmark results for changing it?

llvm/test/CodeGen/AArch64/aarch64-loop-gep-opt.ll
60	Should this be loading from nullptr?

There is a RISCV test that is failing.

I checked that out and the pass is correct for RISCV. I am thinking of moving this to be handled in AArch64 backend instead.

Can you explain more why this is the correct fix? Why is it limited to 2 operand geps?

This is a correct fix for AArch64 because when we have a GEP with one operand and we split it we just end up generating extra instructions.
Taking the following block from the testcase:

%tmp1 = load i64, i64* %base, align 8
%tmp22 = add nsw i64 %tmp1, -1
%tmp23 = getelementptr inbounds i64, i64* %arg, i64 %tmp22
%tmp24 = load i64, i64* %tmp23, align 2
br label %bb25

After the pass it will be this

%0 = bitcast i64* %arg to i8*
%1 = shl i64 %tmp1, 3
%uglygep = getelementptr i8, i8* %0, i64 %1
%uglygep2 = getelementptr i8, i8* %uglygep, i64 -8
%2 = bitcast i8* %uglygep2 to i64*
%tmp24 = load i64, i64* %2, align 2
br label %bb25

%1 = shl i64 %tmp1, 3 is unnecessary here and the backend fails to fold it into the load so we get the following assembly

ldr	x9, [x1]
mov	w8, #1
add	x10, x0, x9, lsl #3
sub	x9, x9, #1
ldur	x10, [x10, #-8]

While disabling the pass will generate better assembly

ldr	x8, [x1]
mov	w10, #1
sub	x8, x8, #1
ldr	x9, [x0, x8, lsl  #3]

Do you have benchmark results for changing it?

The patch increased the performance by ~5% on Python workloads in a proprietary set of benchmarks. I will test it on SPEC2017 and check the impact of the fix

llvm/test/CodeGen/AArch64/aarch64-loop-gep-opt.ll
60	It is necessary, I think it is better to move `%tmp1` to be a function argument instead

In D135957#3864526, @zjaffal wrote:

There is a RISCV test that is failing.

I checked that out and the pass is correct for RISCV. I am thinking of moving this to be handled in AArch64 backend instead.

Correction, I think splitting is also worse for RISCV:
This is the failing test case for RISCV

define i64 @test1(i64* %array, i64 %i, i64 %j)  {

entry:
  %add = add nsw i64 %i, 5
  %gep = getelementptr inbounds i64, i64* %array, i64 %add
  store i64 %j, i64* %gep
  %add2 = add nsw i64 %i, 6
  %gep2 = getelementptr inbounds i64, i64* %array, i64 %add2
  store i64 %j, i64* %gep2
  %add3 = add nsw i64 %i, 35
  %gep3 = getelementptr inbounds i64, i64* %array, i64 %add3
  store i64 %add, i64* %gep3
  ret i64 undef
}

Splitting generates the following IR

addi    a3, a1, 5
slli    a4, a3, 3
add     a4, a0, a4
srli    a5, a2, 32
sw      a5, 4(a4)
sw      a2, 0(a4)
slli    a1, a1, 3
add     a0, a1, a0
sw      a5, 52(a0)
sw      a2, 48(a0)
sw      a3, 280(a0)
srli    a1, a3, 32
sw      a1, 284(a0)
ret

While disabling the pass generates

addi    a3, a1, 5
slli    a4, a3, 3
add     a4, a0, a4
sd      a2, 0(a4)
slli    a1, a1, 3
add     a0, a1, a0
sd      a2, 48(a0)
sd      a3, 280(a0)
ret

Which contains fewer instructions.

Fix test case and modify riscv test

Herald added subscribers: • pcwang-thead, frasercrmck, luismarques and 20 others. · View Herald TranscriptOct 20 2022, 10:56 AM

Harbormaster completed remote builds in B193282: Diff 469296.Oct 20 2022, 12:22 PM

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep? Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

Perhaps isLegalAddressingMode should be being queried more too, I'm not sure. It might require some careful benchmarking for me to be sure it isn't making some cases worse. Some cases might have multiple uses that can all be converted successfully together.

In D135957#3878970, @dmgreen wrote:

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep?

I am not sure if that would be beneficial. I can't think of an example multi index gep where all the geps are non constants. I can move the check inside the for loop and see if that changes any test cases this would be a good check at first.

Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

Do we need to eliminate any Add/Sub/OR that is used in instructions that aren't GEPs or that would be very strict?

In D135957#3878970, @dmgreen wrote:

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep? Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

I added the following check in ConstantOffsetExtractor::CanTraceInto

if(!SignExtended && !ZeroExtended){
  if (!llvm::all_of(BO->users(),
                     [](const User *U) { return isa<GetElementPtrInst>(U); }))
    return false;
}

I think it might be more general this way. I think we might need similar checks for sext and `zext operands.

In D135957#3885245, @zjaffal wrote:
In D135957#3878970, @dmgreen wrote:

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep? Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

I added the following check in ConstantOffsetExtractor::CanTraceInto
if(!SignExtended && !ZeroExtended){
  if (!llvm::all_of(BO->users(),
                     [](const User *U) { return isa<GetElementPtrInst>(U); }))
    return false;
}
I think it might be more general this way. I think we might need similar checks for sext and `zext operands.

Actually this causes more tests to fail. I think for now doing it for only GEPs with one operand is fine. What do you think @dmgreen

Actually this causes more tests to fail. I think for now doing it for only GEPs with one operand is fine. What do you think @dmgreen

Hello. Hmm. It's hard to tell without doing a lot of experimenting. Do you have examples of the failing tests? Are they getting worse, or just different?

I feel like you have identified a real problem here, but the solution doesn't look very general. It seems, perhaps without knowing enough about the details yet, that ConstantOffsetExtractor::Extract should be considering the uses more than it does.

Let me run some experiments. We've had another report of SeparateConstOffsetFromGEP (or - specifically the CSE) causing issues. It may be better to revert it for the time being.

dmgreen mentioned this in rG201b7858f695: [AArch64] Disable aarch64-enable-gep-opt.Nov 19 2022, 1:25 PM

I've reverted the option in 201b7858f6957d84bc75bd228224d7b28d7df61e for the time being.

Is this still needed?

Herald added subscribers: luke, StephenFan. · View Herald TranscriptFeb 1 2023, 3:01 AM

zjaffal abandoned this revision.Feb 21 2023, 4:07 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

SeparateConstOffsetFromGEP.cpp

9 lines

test/

CodeGen/

AArch64/

aarch64-loop-gep-opt.ll

26 lines

Transforms/

SeparateConstOffsetFromGEP/

RISCV/

split-gep.ll

14 lines

Diff 469296

llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp

Show First 20 Lines • Show All 997 Lines • ▼ Show 20 Lines	bool SeparateConstOffsetFromGEP::splitGEP(GetElementPtrInst *GEP) {

// Remove the constant offset in each sequential index. The resultant GEP		// Remove the constant offset in each sequential index. The resultant GEP
// computes the variadic base.		// computes the variadic base.
// Notice that we don't remove struct field indices here. If LowerGEP is		// Notice that we don't remove struct field indices here. If LowerGEP is
// disabled, a structure index is not accumulated and we still use the old		// disabled, a structure index is not accumulated and we still use the old
// one. If LowerGEP is enabled, a structure index is accumulated in the		// one. If LowerGEP is enabled, a structure index is accumulated in the
// constant offset. LowerToSingleIndexGEPs or lowerToArithmetics will later		// constant offset. LowerToSingleIndexGEPs or lowerToArithmetics will later
// handle the constant offset and won't need a new structure index.		// handle the constant offset and won't need a new structure index.

		// If GEP only has one index and it is used by non-gep instructions,
		// it is not benificial to do the split.
		if (GEP->getNumOperands() == 2) {
		Value *Idx = GEP->getOperand(1);
		if (!llvm::all_of(Idx->users(),
		[](const User *U) { return isa<GetElementPtrInst>(U); }))
		return false;
		}
gep_type_iterator GTI = gep_type_begin(*GEP);		gep_type_iterator GTI = gep_type_begin(*GEP);
for (unsigned I = 1, E = GEP->getNumOperands(); I != E; ++I, ++GTI) {		for (unsigned I = 1, E = GEP->getNumOperands(); I != E; ++I, ++GTI) {
if (GTI.isSequential()) {		if (GTI.isSequential()) {
// Splits this GEP index into a variadic part and a constant offset, and		// Splits this GEP index into a variadic part and a constant offset, and
// uses the variadic part as the new index.		// uses the variadic part as the new index.
Value *OldIdx = GEP->getOperand(I);		Value *OldIdx = GEP->getOperand(I);
User *UserChainTail;		User *UserChainTail;
Value *NewIdx =		Value *NewIdx =
▲ Show 20 Lines • Show All 374 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-loop-gep-opt.ll

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	do.body.i.backedge:
%.be6 = phi i32 [ %nb.1.i, %do.body.i ], [ 0, %fooo.exit ]		%.be6 = phi i32 [ %nb.1.i, %do.body.i ], [ 0, %fooo.exit ]
br label %do.body.i		br label %do.body.i

fooo.exit: ; preds = %do.body.i		fooo.exit: ; preds = %do.body.i
store i32 %nb.1.i, i32* %k0, align 4		store i32 %nb.1.i, i32* %k0, align 4
br label %do.body.i.backedge		br label %do.body.i.backedge
}		}


		define i64 @test_loop2(i64* %arg, i64* %save, i64 %tmp1) {
		bb:
		; CHECK-LABEL: bb:
		; CHECK %tmp1 = load i64, i64* null, align 8
		; CHECK %tmp22 = add nsw i64 %tmp1, -1
		; CHECK-NOT %uglygep = getelementptr i8, i8* %0, i64 %1
		; CHECK-NOT %uglygep2 = getelementptr i8, i8* %uglygep, i64 -8

		%tmp22 = add nsw i64 %tmp1, -1
		dmgreenUnsubmitted Not Done Reply Inline Actions Should this be loading from nullptr? dmgreen: Should this be loading from nullptr?
		zjaffalAuthorUnsubmitted Done Reply Inline Actions It is necessary, I think it is better to move `%tmp1` to be a function argument instead zjaffal: It is necessary, I think it is better to move `%tmp1` to be a function argument instead
		%tmp23 = getelementptr inbounds i64, i64* %arg, i64 %tmp22
		%tmp24 = load i64, i64* %tmp23, align 2
		br label %bb25

		bb25: ; preds = %bb25, %bb18
		%tmp26 = phi i64 [ 1, %bb ], [ 0, %bb25 ]
		%tmp29 = icmp eq i64 0, %tmp24
		%tmp30 = select i1 %tmp29, i64 0, i64 %tmp26
		store i64 %tmp30, i64* %save
		%tmp31 = icmp eq i64 0, %tmp22
		br i1 %tmp31, label %bb32, label %bb25

		bb32: ; preds = %bb25
		ret i64 0

		}
		No newline at end of file

llvm/test/Transforms/SeparateConstOffsetFromGEP/RISCV/split-gep.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -mtriple=riscv64-unknown-elf -passes='separate-const-offset-from-gep,early-cse' \			; RUN: opt < %s -mtriple=riscv64-unknown-elf -passes='separate-const-offset-from-gep,early-cse' \
	; RUN: -S \| FileCheck %s			; RUN: -S \| FileCheck %s

	; Several tests for -separate-const-offset-from-gep. The transformation			; Several tests for -separate-const-offset-from-gep. The transformation
	; heavily relies on TargetTransformInfo, so we put these tests under			; heavily relies on TargetTransformInfo, so we put these tests under
	; target-specific folders.			; target-specific folders.

	; Simple case when GEPs should be optimized.			; Simple case when GEPs should be optimized.
	define i64 @test1(i64* %array, i64 %i, i64 %j) {			define i64 @test1(i64* %array, i64 %i, i64 %j) {
	; CHECK-LABEL: @test1(			; CHECK-LABEL: @test1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ADD:%.]] = add nsw i64 [[I:%.]], 5			; CHECK-NEXT: [[ADD:%.]] = add nsw i64 [[I:%.]], 5
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr i64, i64 [[ARRAY:%.*]], i64 [[I]]			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds i64, i64 [[ARRAY:%.*]], i64 [[ADD]]
	; CHECK-NEXT: [[GEP4:%.]] = getelementptr inbounds i64, i64 [[TMP0]], i64 5			; CHECK-NEXT: store i64 [[J:%.]], i64 [[GEP]], align 4
	; CHECK-NEXT: store i64 [[J:%.]], i64 [[GEP4]], align 4			; CHECK-NEXT: [[TMP0:%.]] = getelementptr i64, i64 [[ARRAY]], i64 [[I]]
	; CHECK-NEXT: [[GEP26:%.]] = getelementptr inbounds i64, i64 [[TMP0]], i64 6			; CHECK-NEXT: [[GEP22:%.]] = getelementptr inbounds i64, i64 [[TMP0]], i64 6
	; CHECK-NEXT: store i64 [[J]], i64* [[GEP26]], align 4			; CHECK-NEXT: store i64 [[J]], i64* [[GEP22]], align 4
	; CHECK-NEXT: [[GEP38:%.]] = getelementptr inbounds i64, i64 [[TMP0]], i64 35			; CHECK-NEXT: [[GEP34:%.]] = getelementptr inbounds i64, i64 [[TMP0]], i64 35
	; CHECK-NEXT: store i64 [[ADD]], i64* [[GEP38]], align 4			; CHECK-NEXT: store i64 [[ADD]], i64* [[GEP34]], align 4
	; CHECK-NEXT: ret i64 undef			; CHECK-NEXT: ret i64 undef
	;			;
	entry:			entry:
	%add = add nsw i64 %i, 5			%add = add nsw i64 %i, 5
	%gep = getelementptr inbounds i64, i64* %array, i64 %add			%gep = getelementptr inbounds i64, i64* %array, i64 %add
	store i64 %j, i64* %gep			store i64 %j, i64* %gep
	%add2 = add nsw i64 %i, 6			%add2 = add nsw i64 %i, 6
	%gep2 = getelementptr inbounds i64, i64* %array, i64 %add2			%gep2 = getelementptr inbounds i64, i64* %array, i64 %add2
	▲ Show 20 Lines • Show All 263 Lines • Show Last 20 Lines