Page MenuHomePhabricator

[InstCombine] Canonicalize GEP of GEP by swapping constant-indexed GEP to the front
AbandonedPublic

Authored by huangjd on Jul 14 2022, 12:07 AM.

Details

Summary

Alternative implementation to D125845. This way the code is cleaner as it will not interfere with LICM, because swapping constant GEP to the front will actually allow LICM to move it out of the loop. In addition, it is possibly more beneficial to codegen since evaluation of variable indices is pushed to the back, reducing potential pipeline stall on some architectures.

Diff Detail

Unit TestsFailed

TimeTest
60,050 msx64 debian > MLIR.Examples/standalone::test.toy
Script: -- : 'RUN: at line 1'; /usr/bin/cmake /var/lib/buildkite-agent/builds/llvm-project/mlir/examples/standalone -G "Ninja" -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DCMAKE_C_COMPILER=/usr/bin/clang -DLLVM_ENABLE_LIBCXX=OFF -DMLIR_DIR=/var/lib/buildkite-agent/builds/llvm-project/build/lib/cmake/mlir -DLLVM_USE_LINKER=lld
60,070 msx64 debian > ThreadSanitizer-x86_64.ThreadSanitizer-x86_64::restore_stack.cpp
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=thread -Wall -m64 -msse4.2 -gline-tables-only -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/../ -std=c++11 -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/../ -O1 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/restore_stack.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/tsan/X86_64Config/Output/restore_stack.cpp.tmp && not /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/tsan/X86_64Config/Output/restore_stack.cpp.tmp 2>&1 | FileCheck /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/restore_stack.cpp

Event Timeline

huangjd created this revision.Jul 14 2022, 12:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2022, 12:07 AM
huangjd requested review of this revision.Jul 14 2022, 12:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2022, 12:07 AM

Note: this is a NEW patch and the implementation is different from D125845.

huangjd added inline comments.Jul 14 2022, 12:15 AM
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
1981

Apparently this statement actually causes a bug, which inhibits merging of constant-index GEP in vector-reverse-mask4.ll and interleaved-accesses.ll

llvm/test/Transforms/InstCombine/gep-canonicalize-constant-indices.ll
27

test is no longer relevant since we are swapping constant-indexed GEP to the front

52–54

Simplified unnecessarily complicated test

55

C

huangjd updated this revision to Diff 444537.Jul 14 2022, 12:15 AM
  • Remove now irrelevant test

I think this will do worse in practice, because it breaks GVN. Imagine you have accesses like ary[i].x, ary[i].y, ary[i].z represented with two GEPs. If constants are canonicalized to the end, then the ary[i] GEP is the same all three times and can be CSEd/GVNd. If the constants are canonicalized to the start, then these all become distinct GEPs on different bases.

I think this will do worse in practice, because it breaks GVN. Imagine you have accesses like ary[i].x, ary[i].y, ary[i].z represented with two GEPs. If constants are canonicalized to the end, then the ary[i] GEP is the same all three times and can be CSEd/GVNd. If the constants are canonicalized to the start, then these all become distinct GEPs on different bases.

Expressions like ary[i].x will always be merged in visitGEPofGEP into getelementptr type, ptr ary, i64 i, i32 0 even without my patch. In fact almost all valid C++ array/member access expression will be coalesced into one multi-index GEP, unless you are casting the pointer into another type that it actually isn't, and try to do more pointer arithmetic, which is probably intentional UB anyways.

If that breaks GVN then I think GEP transform pass should be pushed back so that CSE happens first. In the other hand, CSE should actually not apply to accesses in your case, because in most architecture load instruction can take a constant offset, so it wouldn't be necessary to compute the intermediate address of ary[i].

Can you provide an example showing how this is better than the alternative patch (e.g. helping LICM)?

Also provide an example showing nikic's concern is addressed?

Consider the following practical example

struct Vec {
    float  x, y, z;
};

float f1(Vec vecs[], size_t n) {
    float sum = 0;
    for (size_t i = 0; i < n; i++) {
        float g = 0;
        g += vecs[i].x * vecs[i].x;
        g += vecs[i].y * vecs[i].y;
        g += vecs[i].z * vecs[i].z;
        sum += sqrtf(g);
    }
    return sum;
}

llvm generates the following with optimizations without my patch

define dso_local noundef float @_Z2f1P3Vecm(ptr nocapture noundef readonly %0, i64 noundef %1) local_unnamed_addr #0 !dbg !361 {
  %3 = icmp eq i64 %1, 0, !dbg !385
  br i1 %3, label %4, label %6, !dbg !386

4: ; preds = %6, %2
  %5 = phi float [ 0.000000e+00, %2 ], [ %19, %6 ], !dbg !383
  ret float %5, !dbg !387

6: ; preds = %2, %6
  %7 = phi float [ %19, %6 ], [ 0.000000e+00, %2 ]
  %8 = phi i64 [ %20, %6 ], [ 0, %2 ]
  %9 = getelementptr inbounds %struct.Vec, ptr %0, i64 %8, !dbg !389
  %10 = load float, ptr %9, align 4, !dbg !390, !tbaa !391
  %11 = tail call float @llvm.fmuladd.f32(float %10, float %10, float 0.000000e+00), !dbg !396
  %12 = getelementptr inbounds %struct.Vec, ptr %0, i64 %8, i32 1, !dbg !397
  %13 = load float, ptr %12, align 4, !dbg !397, !tbaa !398
  %14 = tail call float @llvm.fmuladd.f32(float %13, float %13, float %11), !dbg !399
  %15 = getelementptr inbounds %struct.Vec, ptr %0, i64 %8, i32 2, !dbg !400
  %16 = load float, ptr %15, align 4, !dbg !400, !tbaa !401
  %17 = tail call float @llvm.fmuladd.f32(float %16, float %16, float %14), !dbg !402
  %18 = tail call float @llvm.sqrt.f32(float %17), !dbg !403
  %19 = fadd float %7, %18, !dbg !404
  %20 = add nuw i64 %8, 1, !dbg !405
  %21 = icmp eq i64 %20, %1, !dbg !385
  br i1 %21, label %4, label %6, !dbg !386, !llvm.loop !406
}

GEP of GEP are merged at a very early pass before common subexpression, and I actually couldn't write any C++ code that would make LLVM generate a GEP of GEP where the second one has constant index.

The example provided shows that there is a missing opportunity for CSE (of address computation of Vec[i]). The decision on GEP ordering should probably better be based on more general reassociation analysis for enabling more CSE/LICM.

follow up @davidxl
For the original issue where a chain of 3 or more gep with the first and last being constant indexed cannot be simplified, this patch can handle such case while D125845 can't
Consider the following code

01   b = gep a, const_index
02   use(b)
03   c = gep b, var_index
04   d = gep c, const_index
05   ret d

In this patch line 4 is swapped with line 3, and then it is constant-index folded with line 1 (line 1 is unchanged, and the old line 4 is replaced with gep a with new constant indices, breaking the dependency of b, and result in better codegen.

But if D125845 is used, then line 1 is supposed to be swapped with line 3, but since b has more than 1 use, the swap is inhibited, and nothing can be optimized.

follow up @davidxl
For the original issue where a chain of 3 or more gep with the first and last being constant indexed cannot be simplified, this patch can handle such case while D125845 can't
Consider the following code

01   b = gep a, const_index
02   use(b)
03   c = gep b, var_index
04   d = gep c, const_index
05   ret d

In this patch line 4 is swapped with line 3, and then it is constant-index folded with line 1 (line 1 is unchanged, and the old line 4 is replaced with gep a with new constant indices, breaking the dependency of b, and result in better codegen.

But if D125845 is used, then line 1 is supposed to be swapped with line 3, but since b has more than 1 use, the swap is inhibited, and nothing can be optimized.

In theory, 01 should be propagated into 03 and then 04 after which 03 is deleted.

To compare two patches, I think it is worth collecting some benchmark numbers.

huangjd abandoned this revision.Oct 20 2022, 11:04 AM