Alternative implementation to D125845. This way the code is cleaner as it will not interfere with LICM, because swapping constant GEP to the front will actually allow LICM to move it out of the loop. In addition, it is possibly more beneficial to codegen since evaluation of variable indices is pushed to the back, reducing potential pipeline stall on some architectures.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
1981 | Apparently this statement actually causes a bug, which inhibits merging of constant-index GEP in vector-reverse-mask4.ll and interleaved-accesses.ll | |
llvm/test/Transforms/InstCombine/gep-canonicalize-constant-indices.ll | ||
27 | test is no longer relevant since we are swapping constant-indexed GEP to the front | |
52 | Simplified unnecessarily complicated test | |
55 | C |
I think this will do worse in practice, because it breaks GVN. Imagine you have accesses like ary[i].x, ary[i].y, ary[i].z represented with two GEPs. If constants are canonicalized to the end, then the ary[i] GEP is the same all three times and can be CSEd/GVNd. If the constants are canonicalized to the start, then these all become distinct GEPs on different bases.
Expressions like ary[i].x will always be merged in visitGEPofGEP into getelementptr type, ptr ary, i64 i, i32 0 even without my patch. In fact almost all valid C++ array/member access expression will be coalesced into one multi-index GEP, unless you are casting the pointer into another type that it actually isn't, and try to do more pointer arithmetic, which is probably intentional UB anyways.
If that breaks GVN then I think GEP transform pass should be pushed back so that CSE happens first. In the other hand, CSE should actually not apply to accesses in your case, because in most architecture load instruction can take a constant offset, so it wouldn't be necessary to compute the intermediate address of ary[i].
Can you provide an example showing how this is better than the alternative patch (e.g. helping LICM)?
Also provide an example showing nikic's concern is addressed?
Consider the following practical example
struct Vec { float x, y, z; }; float f1(Vec vecs[], size_t n) { float sum = 0; for (size_t i = 0; i < n; i++) { float g = 0; g += vecs[i].x * vecs[i].x; g += vecs[i].y * vecs[i].y; g += vecs[i].z * vecs[i].z; sum += sqrtf(g); } return sum; }
llvm generates the following with optimizations without my patch
define dso_local noundef float @_Z2f1P3Vecm(ptr nocapture noundef readonly %0, i64 noundef %1) local_unnamed_addr #0 !dbg !361 { %3 = icmp eq i64 %1, 0, !dbg !385 br i1 %3, label %4, label %6, !dbg !386 4: ; preds = %6, %2 %5 = phi float [ 0.000000e+00, %2 ], [ %19, %6 ], !dbg !383 ret float %5, !dbg !387 6: ; preds = %2, %6 %7 = phi float [ %19, %6 ], [ 0.000000e+00, %2 ] %8 = phi i64 [ %20, %6 ], [ 0, %2 ] %9 = getelementptr inbounds %struct.Vec, ptr %0, i64 %8, !dbg !389 %10 = load float, ptr %9, align 4, !dbg !390, !tbaa !391 %11 = tail call float @llvm.fmuladd.f32(float %10, float %10, float 0.000000e+00), !dbg !396 %12 = getelementptr inbounds %struct.Vec, ptr %0, i64 %8, i32 1, !dbg !397 %13 = load float, ptr %12, align 4, !dbg !397, !tbaa !398 %14 = tail call float @llvm.fmuladd.f32(float %13, float %13, float %11), !dbg !399 %15 = getelementptr inbounds %struct.Vec, ptr %0, i64 %8, i32 2, !dbg !400 %16 = load float, ptr %15, align 4, !dbg !400, !tbaa !401 %17 = tail call float @llvm.fmuladd.f32(float %16, float %16, float %14), !dbg !402 %18 = tail call float @llvm.sqrt.f32(float %17), !dbg !403 %19 = fadd float %7, %18, !dbg !404 %20 = add nuw i64 %8, 1, !dbg !405 %21 = icmp eq i64 %20, %1, !dbg !385 br i1 %21, label %4, label %6, !dbg !386, !llvm.loop !406 }
GEP of GEP are merged at a very early pass before common subexpression, and I actually couldn't write any C++ code that would make LLVM generate a GEP of GEP where the second one has constant index.
The example provided shows that there is a missing opportunity for CSE (of address computation of Vec[i]). The decision on GEP ordering should probably better be based on more general reassociation analysis for enabling more CSE/LICM.
follow up @davidxl
For the original issue where a chain of 3 or more gep with the first and last being constant indexed cannot be simplified, this patch can handle such case while D125845 can't
Consider the following code
01 b = gep a, const_index 02 use(b) 03 c = gep b, var_index 04 d = gep c, const_index 05 ret d
In this patch line 4 is swapped with line 3, and then it is constant-index folded with line 1 (line 1 is unchanged, and the old line 4 is replaced with gep a with new constant indices, breaking the dependency of b, and result in better codegen.
But if D125845 is used, then line 1 is supposed to be swapped with line 3, but since b has more than 1 use, the swap is inhibited, and nothing can be optimized.
In theory, 01 should be propagated into 03 and then 04 after which 03 is deleted.
To compare two patches, I think it is worth collecting some benchmark numbers.
Apparently this statement actually causes a bug, which inhibits merging of constant-index GEP in vector-reverse-mask4.ll and interleaved-accesses.ll