This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SeperateConstOffsetFromGEP] Prevent pass from splitting GEP if an index has more than one use
AbandonedPublic

Authored by zjaffal on Oct 14 2022, 4:11 AM.

Details

Summary

This fixes a 5% performance regression in CPython for some inputs by fixing codegen for any_find_slice in unicodeobject.c
https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L8748

When enabling -aarch64-enable-gep-opt the transform splits GEP instruction generating unecessary instructions.

Diff Detail

Event Timeline

zjaffal created this revision.Oct 14 2022, 4:11 AM
zjaffal requested review of this revision.Oct 14 2022, 4:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2022, 4:11 AM

Hmm. I somewhat regret turning this option on, as I've seen other issues elsewhere too. It seems like it is still a decent improvement in all the testing I've done though.

There is a RISCV test that is failing.

Can you explain more why this is the correct fix? Why is it limited to 2 operand geps? Do you have benchmark results for changing it?

llvm/test/CodeGen/AArch64/aarch64-loop-gep-opt.ll
60

Should this be loading from nullptr?

There is a RISCV test that is failing.

I checked that out and the pass is correct for RISCV. I am thinking of moving this to be handled in AArch64 backend instead.

Can you explain more why this is the correct fix? Why is it limited to 2 operand geps?

This is a correct fix for AArch64 because when we have a GEP with one operand and we split it we just end up generating extra instructions.
Taking the following block from the testcase:

%tmp1 = load i64, i64* %base, align 8
%tmp22 = add nsw i64 %tmp1, -1
%tmp23 = getelementptr inbounds i64, i64* %arg, i64 %tmp22
%tmp24 = load i64, i64* %tmp23, align 2
br label %bb25

After the pass it will be this

%0 = bitcast i64* %arg to i8*
%1 = shl i64 %tmp1, 3
%uglygep = getelementptr i8, i8* %0, i64 %1
%uglygep2 = getelementptr i8, i8* %uglygep, i64 -8
%2 = bitcast i8* %uglygep2 to i64*
%tmp24 = load i64, i64* %2, align 2
br label %bb25

%1 = shl i64 %tmp1, 3 is unnecessary here and the backend fails to fold it into the load so we get the following assembly

ldr	x9, [x1]
mov	w8, #1
add	x10, x0, x9, lsl #3
sub	x9, x9, #1
ldur	x10, [x10, #-8]

While disabling the pass will generate better assembly

ldr	x8, [x1]
mov	w10, #1
sub	x8, x8, #1
ldr	x9, [x0, x8, lsl  #3]

Do you have benchmark results for changing it?

The patch increased the performance by ~5% on Python workloads in a proprietary set of benchmarks. I will test it on SPEC2017 and check the impact of the fix

llvm/test/CodeGen/AArch64/aarch64-loop-gep-opt.ll
60

It is necessary, I think it is better to move %tmp1 to be a function argument instead

There is a RISCV test that is failing.

I checked that out and the pass is correct for RISCV. I am thinking of moving this to be handled in AArch64 backend instead.

Correction, I think splitting is also worse for RISCV:
This is the failing test case for RISCV

define i64 @test1(i64* %array, i64 %i, i64 %j)  {

entry:
  %add = add nsw i64 %i, 5
  %gep = getelementptr inbounds i64, i64* %array, i64 %add
  store i64 %j, i64* %gep
  %add2 = add nsw i64 %i, 6
  %gep2 = getelementptr inbounds i64, i64* %array, i64 %add2
  store i64 %j, i64* %gep2
  %add3 = add nsw i64 %i, 35
  %gep3 = getelementptr inbounds i64, i64* %array, i64 %add3
  store i64 %add, i64* %gep3
  ret i64 undef
}

Splitting generates the following IR

addi    a3, a1, 5
slli    a4, a3, 3
add     a4, a0, a4
srli    a5, a2, 32
sw      a5, 4(a4)
sw      a2, 0(a4)
slli    a1, a1, 3
add     a0, a1, a0
sw      a5, 52(a0)
sw      a2, 48(a0)
sw      a3, 280(a0)
srli    a1, a3, 32
sw      a1, 284(a0)
ret

While disabling the pass generates

addi    a3, a1, 5
slli    a4, a3, 3
add     a4, a0, a4
sd      a2, 0(a4)
slli    a1, a1, 3
add     a0, a1, a0
sd      a2, 48(a0)
sd      a3, 280(a0)
ret

Which contains fewer instructions.

zjaffal updated this revision to Diff 469296.Oct 20 2022, 10:56 AM

Fix test case and modify riscv test

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep? Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

Perhaps isLegalAddressingMode should be being queried more too, I'm not sure. It might require some careful benchmarking for me to be sure it isn't making some cases worse. Some cases might have multiple uses that can all be converted successfully together.

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep?

I am not sure if that would be beneficial. I can't think of an example multi index gep where all the geps are non constants. I can move the check inside the for loop and see if that changes any test cases this would be a good check at first.

Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

Do we need to eliminate any Add/Sub/OR that is used in instructions that aren't GEPs or that would be very strict?

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep? Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

I added the following check in ConstantOffsetExtractor::CanTraceInto

if(!SignExtended && !ZeroExtended){
  if (!llvm::all_of(BO->users(),
                     [](const User *U) { return isa<GetElementPtrInst>(U); }))
    return false;
}

I think it might be more general this way. I think we might need similar checks for sext and `zext operands.

Hmm. Should we be going further though? i.e why is it limited to the first index of a 2 element gep? Should ConstantOffsetExtractor::Extract be considering the number of uses of the Add/Sub/Or is examines, and checking whether we are going to require duplicating instructions to separate the constant.

I added the following check in ConstantOffsetExtractor::CanTraceInto

if(!SignExtended && !ZeroExtended){
  if (!llvm::all_of(BO->users(),
                     [](const User *U) { return isa<GetElementPtrInst>(U); }))
    return false;
}

I think it might be more general this way. I think we might need similar checks for sext and `zext operands.

Actually this causes more tests to fail. I think for now doing it for only GEPs with one operand is fine. What do you think @dmgreen

Actually this causes more tests to fail. I think for now doing it for only GEPs with one operand is fine. What do you think @dmgreen

Hello. Hmm. It's hard to tell without doing a lot of experimenting. Do you have examples of the failing tests? Are they getting worse, or just different?

I feel like you have identified a real problem here, but the solution doesn't look very general. It seems, perhaps without knowing enough about the details yet, that ConstantOffsetExtractor::Extract should be considering the uses more than it does.

Let me run some experiments. We've had another report of SeparateConstOffsetFromGEP (or - specifically the CSE) causing issues. It may be better to revert it for the time being.

I've reverted the option in 201b7858f6957d84bc75bd228224d7b28d7df61e for the time being.

fhahn added a comment.Feb 1 2023, 3:01 AM

Is this still needed?

zjaffal abandoned this revision.Feb 21 2023, 4:07 AM