This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fold tree of offset loads combine
ClosedPublic

Authored by dmgreen on Jun 28 2023, 6:58 AM.

Details

Summary

This attempts to fold trees of add(ext(load p), shl(ext(load p+4)) into a single load of twice the size, that we extract the bottom part and top part so that the shl can start to use a shll2 instruction. The two loads in that example can also be larger trees of instructions, which are identical except for the leaves which are all loads offset from the LHS, including buildvectors of multiple loads. For example
sub(zext(buildvec(load p+4, load q+4)), zext(buildvec(load r+4, load s+4)))

Whilst it can be common for the larger loads to replace LDP instructions (which doesn't gain anything on it's own), the larger loads in buildvectors can help create more efficient code, and prevent the need for ld1 lane inserts which can be slower than normal loads.

This creates a fairly niche, fairly large combine that attempts to be fairly general where it is beneficial. It helps some SLP vectorized code to avoid the use of the more expensive ld1 lane inserting loads.

Diff Detail

Event Timeline

dmgreen created this revision.Jun 28 2023, 6:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2023, 6:58 AM
dmgreen requested review of this revision.Jun 28 2023, 6:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2023, 6:58 AM

Yeah, I am totally onboard with the motivation of this work. We have seen this come up as a problem a few times and it is easy to see from the test codegen changes that this will give some decent improvements. The large combine and the amount of code is a little bit unfortunate, but I guess it is what it is.

The usual testing question because this is not visible in these diffs: are the tests with volatile loads and atomics in place where this shouldn't trigger?

I have just done a first pass over this code and left nits inlined. Will do a second pass tomorrow.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18541

Nit: hasOneUse is already checked above.

18572

Not sure if it deserve a comment, but why the 4 here?

18592

Nit: .getOperand(0).getOperand(0).getOperand(0) is a bit cryptic, perhaps a pointer to it helps.

18628

Nit: this can be checked first in the function as an early exit before doing more work in isLoadOrMultipleLoads?

18693

typo

dmgreen updated this revision to Diff 535660.Jun 29 2023, 12:45 AM
dmgreen marked 5 inline comments as done.

Thanks - addressed comments and added an extra volatile/atomic test.

I was thinking of not committing the extbinopload2.ll test with the final version of this patch. It was autogenerated forms of all the possible type sizes, mostly for showing which cases were profitable and which were not. The important parts of that should already be covered by extbinopload.ll without bulking out the tests unnecessarily.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18572

This code is just trying to match the instructions in the above comment, which is the concat of 4 loads through the shuffles. This code is a bit of a shame as we don't reliably visit operands before the root node, and will not revisit the root after optimizing the leaf operands. See D152928, which should fix this so we can rely only on the BUILD_VECTOR code above. That looks like it will take a long time to work through all the regressions though, and in the meantime I didn't think it was best to make something very complex just for it to be removed later.

18628

The Op0.getOpcode() != Op1.getOpcode() part may be false for loads due to the issue with not reliably simplifying nodes before operands. The full tree we are folding is this, where the LHS and RHS are not equally simplified:

                t24: i64 = add nuw t22, Constant:i64<4>
              t160: f32,ch = load<(load (s32) from %ir.19, align 1, !tbaa !5)> t0, t24, undef:i64
                t20: i64 = add nuw t18, Constant:i64<4>
              t161: f32,ch = load<(load (s32) from %ir.15, align 1, !tbaa !5)> t0, t20, undef:i64
                t16: i64 = add nuw t14, Constant:i64<4>
              t162: f32,ch = load<(load (s32) from %ir.11, align 1, !tbaa !5)> t0, t16, undef:i64
                t12: i64 = add nuw t2, Constant:i64<4>
              t151: f32,ch = load<(load (s32) from %ir.7, align 1, !tbaa !5)> t0, t12, undef:i64 
            t168: v4f32 = BUILD_VECTOR t160, t161, t162, t151
          t167: v16i8 = bitcast t168
        t122: v16i16 = zero_extend t167
                t25: i64 = add nuw t23, Constant:i64<4>
              t139: f32,ch = load<(load (s32) from %ir.20, align 1, !tbaa !5)> t0, t25, undef:i64
                t21: i64 = add nuw t19, Constant:i64<4>                                          
              t140: f32,ch = load<(load (s32) from %ir.16, align 1, !tbaa !5)> t0, t21, undef:i64
                t17: i64 = add nuw t15, Constant:i64<4>                                          
              t141: f32,ch = load<(load (s32) from %ir.12, align 1, !tbaa !5)> t0, t17, undef:i64
                t13: i64 = add nuw t6, Constant:i64<4>                                           
              t130: f32,ch = load<(load (s32) from %ir.8, align 1, !tbaa !5)> t0, t13, undef:i64 
            t147: v4f32 = BUILD_VECTOR t139, t140, t141, t130                                    
          t146: v16i8 = bitcast t147
        t123: v16i16 = zero_extend t146
      t124: v16i16 = sub t122, t123
    t126: v16i32 = any_extend t124
    t72: v16i32 = BUILD_VECTOR Constant:i32<16>, Constant:i32<16>, ...                           
  t73: v16i32 = shl nsw t126, t72
            t206: f32,ch = load<(load (s32) from %ir.17, align 1, !tbaa !5)> t0, t22, undef:i64
            t207: f32,ch = load<(load (s32) from %ir.13, align 1, !tbaa !5)> t0, t18, undef:i64
            t208: f32,ch = load<(load (s32) from %ir.9, align 1, !tbaa !5)> t0, t14, undef:i64
            t197: f32,ch = load<(load (s32) from %ir.0, align 1, !tbaa !5)> t0, t2, undef:i64
          t214: v4f32 = BUILD_VECTOR t206, t207, t208, t197
        t213: v16i8 = bitcast t214
      t169: v16i16 = zero_extend t213
            t185: f32,ch = load<(load (s32) from %ir.18, align 1, !tbaa !5)> t0, t23, undef:i64  
            t186: f32,ch = load<(load (s32) from %ir.14, align 1, !tbaa !5)> t0, t19, undef:i64  
            t187: f32,ch = load<(load (s32) from %ir.10, align 1, !tbaa !5)> t0, t15, undef:i64  
            t176: f32,ch = load<(load (s32) from %ir.2, align 1, !tbaa !5)> t0, t6, undef:i64
          t193: v4f32 = BUILD_VECTOR t185, t186, t187, t176
        t192: v16i8 = bitcast t193
      t170: v16i16 = zero_extend t192
    t171: v16i16 = sub t169, t170
  t172: v16i32 = sign_extend t171
t74: v16i32 = add nsw t73, t172
SjoerdMeijer accepted this revision.Jun 29 2023, 6:53 AM

Thanks, LGTM

This revision is now accepted and ready to land.Jun 29 2023, 6:53 AM
This revision was landed with ongoing or failed builds.Jun 30 2023, 4:25 AM
This revision was automatically updated to reflect the committed changes.