This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Add node in the worklist in topological order in CombineTo
ClosedPublic

Authored by deadalnix on May 1 2022, 1:18 PM.

Details

Summary

This is part of an ongoing effort toward making DAGCombine process the nodes in topological order.

This is able to discover a couple of new optimizations, but also causes a couple of regression. I nevertheless chose to submit this patch for review as to start the discussion with people working on the backend so we can find a good way forward.

Diff Detail

Event Timeline

deadalnix created this revision.May 1 2022, 1:18 PM
deadalnix requested review of this revision.May 1 2022, 1:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 1 2022, 1:18 PM
deadalnix added inline comments.May 1 2022, 1:24 PM
llvm/test/CodeGen/X86/movmsk-cmp.ll
4566

I investigated this regression.

before this patch, the DAG after type legalization and optimization looks like:

SelectionDAG has 25 nodes:
  t0: ch = EntryToken
            t62: i8 = xor t52, Constant:i8<-1>
          t55: i8 = and t62, Constant:i8<1>
          t53: i8 = and t52, Constant:i8<2>
        t56: i8 = or t55, t53
      t57: i8 = setcc t56, Constant:i8<2>, seteq:ch
    t40: i32 = select t57, Constant:i32<42>, Constant:i32<99>
  t19: ch,glue = CopyToReg t0, Register:i32 $eax, t40
        t2: v2f64,ch = CopyFromReg t0, Register:v2f64 %0
        t4: v2f64,ch = CopyFromReg t0, Register:v2f64 %1
      t41: v2i64 = setcc t2, t4, setogt:ch
    t24: i32 = X86ISD::MOVMSK t41
  t52: i8 = truncate t24
  t20: ch = X86ISD::RET_FLAG t19, TargetConstant:i32<0>, Register:i32 $eax, t19:1

After, it looks like:

SelectionDAG has 20 nodes:
  t0: ch = EntryToken
      t55: i8 = and t46, Constant:i8<1>
        t56: i8 = srl t46, Constant:i8<1>
      t21: i32 = select t56, Constant:i32<42>, Constant:i32<99>
    t22: i32 = select t55, t21, Constant:i32<99>
  t19: ch,glue = CopyToReg t0, Register:i32 $eax, t22
        t2: v2f64,ch = CopyFromReg t0, Register:v2f64 %0
        t4: v2f64,ch = CopyFromReg t0, Register:v2f64 %1
      t40: v2i64 = setcc t2, t4, setogt:ch
    t24: i32 = X86ISD::MOVMSK t40
  t46: i8 = truncate t24
  t20: ch = X86ISD::RET_FLAG t19, TargetConstant:i32<0>, Register:i32 $eax, t19:1

Arguably, the DAG after this patch looks simpler/better, so it is hard to argue this does the wrong thing here. However, the rest of the backend doesn't seems to be able to do as good of a job on this DAG. Is there something that can be done here? It seems like there is a legitimate blind spot in the X86 backend rather than a problem with this patch. @RKSimon maybe?

deadalnix added inline comments.
llvm/test/CodeGen/AArch64/arm64-abi-varargs.ll
23

This seems to throw a wrench in D70450 . @fhahn , @hans , @paquette , what do you think?

RKSimon added inline comments.May 2 2022, 1:42 AM
llvm/test/CodeGen/X86/movmsk-cmp.ll
4566

Adding a select(c2, select(c1, x, y), y) fold is possible.

Another option is to just alter the test - InstCombine will now optimize this code to a reduction pattern: https://simd.godbolt.org/z/hfv3YW4Y6

define i32 @PR39665_c_ray(<2 x double> %x, <2 x double> %y) {
  %cmp = fcmp ogt <2 x double> %x, %y
  %shift = shufflevector <2 x i1> %cmp, <2 x i1> poison, <2 x i32> <i32 1, i32 undef>
  %1 = and <2 x i1> %cmp, %shift
  %u = extractelement <2 x i1> %1, i64 0
  %r = select i1 %u, i32 42, i32 99
  ret i32 %r
}
deadalnix added inline comments.May 2 2022, 3:10 AM
llvm/test/CodeGen/X86/movmsk-cmp.ll
4566

There is already a fold for select(c2, select(c1, x, y), y) , but the X86 backend tell DAGCombine that it prefers the double select. I tried to force the transform, and I got a mixed bag out of it. Sometime it improve things, sometime it makes them worse. Overall, the result was unclear.

I'll put up a patch with the updated IR then.

RKSimon added inline comments.May 2 2022, 3:30 AM
llvm/test/CodeGen/X86/movmsk-cmp.ll
4566

Great - can you keep the current test as well (with a TODO comment)? We don't care much about the regression but it'd be useful to keep track of any fix that comes along.

deadalnix added inline comments.May 2 2022, 3:39 AM
llvm/test/CodeGen/X86/movmsk-cmp.ll
4566

Just saw that after submitting the patch. I'll update it to keep both.

deadalnix added inline comments.May 2 2022, 3:45 AM
llvm/test/CodeGen/X86/movmsk-cmp.ll
4566

Done, see D124756

RKSimon edited the summary of this revision. (Show Details)May 2 2022, 3:57 AM
deadalnix updated this revision to Diff 426478.May 2 2022, 12:06 PM

Rebase on top of D124756 .

This ensures there is no regression for optimizd IR, so this is workable - even though there is probably a couple of improvements that the backend could benefit from.

deadalnix added inline comments.May 2 2022, 12:08 PM
llvm/test/CodeGen/X86/movmsk-cmp.ll
4566

I rebased and was able to confirm that the regression doesn't show up on optimized IR, so I think we are good here.

RKSimon added inline comments.May 3 2022, 6:20 AM
llvm/test/CodeGen/ARM/swifterror.ll
687

This looks like another lost multiple load/store ?

deadalnix added inline comments.May 4 2022, 11:04 AM
llvm/test/CodeGen/ARM/swifterror.ll
687

Yes, this looks similar to what is happening to arm64-abi-varargs.ll .

deadalnix added inline comments.May 6 2022, 5:44 AM
llvm/test/CodeGen/ARM/swifterror.ll
687

Investigating this a bit more, it looks like we get some crazy combine happening when CombineTo work in topological order that are missed otherwise.

The dag looks like this with topological CombineTo:

SelectionDAG has 58 nodes:
  t0: ch = EntryToken
  t5: i32,ch = CopyFromReg t0, Register:i32 %1
  t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
  t11: i32,ch = CopyFromReg t0, Register:i32 %2
  t15: i32,ch = CopyFromReg t0, Register:i32 %3
  t19: i32,ch = CopyFromReg t0, Register:i32 %4
          t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
        t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
          t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
        t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
        t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
        t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
      t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
    t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
  t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
  t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
  t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
  t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
  t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
  t43: ch = CopyToReg t41:1, Register:i32 %5, t41
    t45: i32 = add nuw t41, Constant:i32<8>
  t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
      t137: ch = store<(store (s32) into %ir.a12)> t43, t135, FrameIndex:i32<3>, undef:i32
      t128: ch = store<(store (s32) into %ir.args, align 8)> t43, t135:1, FrameIndex:i32<0>, undef:i32
      t140: ch = store<(store (s32) into %ir.a11)> t43, t122, FrameIndex:i32<2>, undef:i32
      t144: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
    t147: ch = TokenFactor t137, t135:2, t128, t140, t122:2, t144, t109:1
  t64: ch,glue = CopyToReg t147, Register:i32 $r0, Constant:i32<1065353216>
  t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
  t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
  t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
  t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>
  t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1

And without:

SelectionDAG has 58 nodes:
  t0: ch = EntryToken
  t5: i32,ch = CopyFromReg t0, Register:i32 %1
  t9: i32 = add FrameIndex:i32<-1>, Constant:i32<4>
  t11: i32,ch = CopyFromReg t0, Register:i32 %2
  t15: i32,ch = CopyFromReg t0, Register:i32 %3
  t19: i32,ch = CopyFromReg t0, Register:i32 %4
  t77: i32 = add FrameIndex:i32<-1>, Constant:i32<8>
  t73: i32 = add FrameIndex:i32<-1>, Constant:i32<12>
        t68: ch = store<(store (s32) into unknown-address + 12)> t0, t19, t73, undef:i32
        t74: ch = store<(store (s32) into unknown-address + 8)> t0, t15, t77, undef:i32
        t78: ch = store<(store (s32) into %fixed-stack.0 + 4)> t0, t11, t9, undef:i32
        t81: ch = store<(store (s32) into %fixed-stack.0)> t0, t5, FrameIndex:i32<-1>, undef:i32
      t83: ch = TokenFactor t19:1, t68, t15:1, t74, t11:1, t78, t5:1, t81
    t29: ch,glue = callseq_start t83, TargetConstant:i32<0>, TargetConstant:i32<0>
  t33: ch,glue = CopyToReg t29, Register:i32 $r0, Constant:i32<16>
  t35: ch,glue = CopyToReg t33, Register:i32 $r1, Constant:i32<0>, t33:1
  t38: ch,glue = ARMISD::CALL t35, TargetGlobalAddress:i32<i8* (i64)* @malloc> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t35:1
  t40: ch,glue = callseq_end t38, TargetConstant:i32<0>, TargetConstant:i32<-1>, t38:1
  t41: i32,ch,glue = CopyFromReg t40, Register:i32 $r0, t40:1
  t43: ch = CopyToReg t41:1, Register:i32 %5, t41
    t45: i32 = add nuw t41, Constant:i32<8>
  t85: ch = store<(store (s8) into %ir.tmp), trunc to i8> t43, Constant:i32<1>, t45, undef:i32
      t135: ch = store<(store (s32) into %ir.a12)> t43, t132, FrameIndex:i32<3>, undef:i32
      t127: ch = store<(store (s32) into %ir.args, align 8)> t43, t73, FrameIndex:i32<0>, undef:i32
      t138: ch = store<(store (s32) into %ir.a11)> t43, t120, FrameIndex:i32<2>, undef:i32
      t142: ch = store<(store (s32) into %ir.a10)> t43, t109, FrameIndex:i32<1>, undef:i32
    t145: ch = TokenFactor t135, t132:1, t127, t138, t120:1, t142, t109:1
  t64: ch,glue = CopyToReg t145, Register:i32 $r0, Constant:i32<1065353216>
  t66: ch,glue = CopyToReg t64, Register:i32 $r8, Register:i32 %5, t64:1
  t109: i32,ch = load<(load (s32))> t85, FrameIndex:i32<-1>, undef:i32
  t120: i32,ch = load<(load (s32))> t85, t9, undef:i32
  t132: i32,ch = load<(load (s32))> t85, t77, undef:i32
  t67: ch = ARMISD::RET_FLAG t66, Register:i32 $r0, Register:i32 $r8, t66:1

The meaningful change seems to be these crazy loads:

t122: i32,i32,ch = load<(load (s32) from %fixed-stack.0 + 8), <post-inc>> t85, t9, Constant:i32<4>
t135: i32,i32,ch = load<(load (s32)), <post-inc>> t85, t122:1, Constant:i32<4>

It'd be really helpful to have some ARM expert here to help us knowing what the right path forward is.

efriedma added inline comments.May 6 2022, 12:33 PM
llvm/test/CodeGen/ARM/swifterror.ll
687

We fail to form a post-inc load, which changes the register allocation, which blocks use of "ldm".

Post-inc load formation failing to trigger is a bit concerning; makes me think we somehow aren't running the relevant combine late enough. DAGCombiner::CombineToPreIndexedLoadStore is a target-independent combine with a bit of target-specific code; see getPreIndexedAddressParts().

"ldm" formation is unreliable on ARM because it runs after register allocation; I wouldn't worry about that part.

deadalnix added inline comments.May 6 2022, 5:16 PM
llvm/test/CodeGen/AArch64/arm64-abi-varargs.ll
23

Pre DAG:

SelectionDAG has 55 nodes:
  t0: ch = EntryToken
  t19: i32,ch = load<(load (s32) from %fixed-stack.1, align 16)> t0, FrameIndex:i64<-1>, undef:i64
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t51: ch = store<(store (s32) into %ir.a1)> t19:1, t19, t2, undef:i64
  t98: i32,ch = load<(load (s32) from %fixed-stack.0, align 8)> t51, FrameIndex:i64<-2>, undef:i64
    t84: i64 = add FrameIndex:i64<-2>, Constant:i64<8>
  t109: i32,ch = load<(load (s32) from %fixed-stack.0 + 8, align 8)> t51, t84, undef:i64
    t108: i64 = add FrameIndex:i64<-2>, Constant:i64<16>
  t121: i32,ch = load<(load (s32) from %fixed-stack.0 + 16, align 8)> t51, t108, undef:i64
      t128: ch = store<(store (s32) into %ir.a12)> t0, t121, FrameIndex:i64<12>, undef:i64
        t120: i64 = add FrameIndex:i64<-2>, Constant:i64<24>
      t116: ch = store<(store (s64) into %ir.args)> t0, t120, FrameIndex:i64<9>, undef:i64
      t124: ch = store<(store (s32) into %ir.a11)> t0, t109, FrameIndex:i64<11>, undef:i64
      t131: ch = store<(store (s32) into %ir.a10)> t0, t98, FrameIndex:i64<10>, undef:i64
        t16: i32,ch = CopyFromReg t0, Register:i32 %7
      t53: ch = store<(store (s32) into %ir.8)> t0, t16, FrameIndex:i64<7>, undef:i64
        t14: i32,ch = CopyFromReg t0, Register:i32 %6
      t56: ch = store<(store (s32) into %ir.7)> t0, t14, FrameIndex:i64<6>, undef:i64
        t12: i32,ch = CopyFromReg t0, Register:i32 %5
      t59: ch = store<(store (s32) into %ir.6)> t0, t12, FrameIndex:i64<5>, undef:i64
        t10: i32,ch = CopyFromReg t0, Register:i32 %4
      t62: ch = store<(store (s32) into %ir.5)> t0, t10, FrameIndex:i64<4>, undef:i64
        t8: i32,ch = CopyFromReg t0, Register:i32 %3
      t65: ch = store<(store (s32) into %ir.4)> t0, t8, FrameIndex:i64<3>, undef:i64
        t6: i32,ch = CopyFromReg t0, Register:i32 %2
      t68: ch = store<(store (s32) into %ir.3)> t0, t6, FrameIndex:i64<2>, undef:i64
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
      t71: ch = store<(store (s32) into %ir.2)> t0, t4, FrameIndex:i64<1>, undef:i64
    t134: ch = TokenFactor t128, t121:1, t116, t124, t109:1, t131, t98:1, t53, t56, t59, t62, t65, t68, t71
  t50: ch = AArch64ISD::RET_FLAG t134

Post DAG:

SelectionDAG has 51 nodes:
  t0: ch = EntryToken
  t19: i32,ch = load<(load (s32) from %fixed-stack.1, align 16)> t0, FrameIndex:i64<-1>, undef:i64
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t51: ch = store<(store (s32) into %ir.a1)> t19:1, t19, t2, undef:i64
  t98: i32,ch = load<(load (s32) from %fixed-stack.0, align 8)> t51, FrameIndex:i64<-2>, undef:i64
    t84: i64 = add FrameIndex:i64<-2>, Constant:i64<8>
  t109: i32,i64,ch = load<(load (s32) from %fixed-stack.0 + 8, align 8), <post-inc>> t51, t84, Constant:i64<8>
  t120: i32,i64,ch = load<(load (s32)), <post-inc>> t51, t109:1, Constant:i64<8>
      t126: ch = store<(store (s32) into %ir.a12)> t0, t120, FrameIndex:i64<12>, undef:i64
      t115: ch = store<(store (s64) into %ir.args)> t0, t120:1, FrameIndex:i64<9>, undef:i64
      t122: ch = store<(store (s32) into %ir.a11)> t0, t109, FrameIndex:i64<11>, undef:i64
      t129: ch = store<(store (s32) into %ir.a10)> t0, t98, FrameIndex:i64<10>, undef:i64
        t16: i32,ch = CopyFromReg t0, Register:i32 %7
      t53: ch = store<(store (s32) into %ir.8)> t0, t16, FrameIndex:i64<7>, undef:i64
        t14: i32,ch = CopyFromReg t0, Register:i32 %6
      t56: ch = store<(store (s32) into %ir.7)> t0, t14, FrameIndex:i64<6>, undef:i64
        t12: i32,ch = CopyFromReg t0, Register:i32 %5
      t59: ch = store<(store (s32) into %ir.6)> t0, t12, FrameIndex:i64<5>, undef:i64
        t10: i32,ch = CopyFromReg t0, Register:i32 %4
      t62: ch = store<(store (s32) into %ir.5)> t0, t10, FrameIndex:i64<4>, undef:i64
        t8: i32,ch = CopyFromReg t0, Register:i32 %3
      t65: ch = store<(store (s32) into %ir.4)> t0, t8, FrameIndex:i64<3>, undef:i64
        t6: i32,ch = CopyFromReg t0, Register:i32 %2
      t68: ch = store<(store (s32) into %ir.3)> t0, t6, FrameIndex:i64<2>, undef:i64
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
      t71: ch = store<(store (s32) into %ir.2)> t0, t4, FrameIndex:i64<1>, undef:i64
    t132: ch = TokenFactor t126, t120:2, t115, t122, t109:2, t129, t98:1, t53, t56, t59, t62, t65, t68, t71
  t50: ch = AArch64ISD::RET_FLAG t132

It is hard to argue with this one that the DAG after this change to the combiner is worse, it definitively seems better to me.

llvm/test/CodeGen/AArch64/swifterror.ll
946

Pre DAG:

SelectionDAG has 41 nodes:
      t0: ch = EntryToken
    t6: ch,glue = callseq_start t0, TargetConstant:i64<0>, TargetConstant:i64<0>
  t10: ch,glue = CopyToReg t6, Register:i64 $x0, Constant:i64<16>
  t13: ch,glue = AArch64ISD::CALL t10, TargetGlobalAddress:i64<i8* (i64)* @malloc> 0, Register:i64 $x0, RegisterMask:Untyped, t10:1
  t14: ch,glue = callseq_end t13, TargetConstant:i64<0>, TargetConstant:i64<0>, t13:1
  t15: i64,ch,glue = CopyFromReg t14, Register:i64 $x0, t14:1
  t17: ch = CopyToReg t15:1, Register:i64 %1, t15
    t19: i64 = add nuw t15, Constant:i64<8>
  t45: ch = store<(store (s8) into %ir.tmp), trunc to i8> t17, Constant:i32<1>, t19, undef:i64
      t100: ch = store<(store (s32) into %ir.a12)> t17, t97, FrameIndex:i64<3>, undef:i64
        t95: i64 = add FrameIndex:i64<-1>, Constant:i64<24>
      t90: ch = store<(store (s64) into %ir.args)> t17, t95, FrameIndex:i64<0>, undef:i64
      t103: ch = store<(store (s32) into %ir.a11)> t17, t83, FrameIndex:i64<2>, undef:i64
      t107: ch = store<(store (s32) into %ir.a10)> t17, t71, FrameIndex:i64<1>, undef:i64
    t110: ch = TokenFactor t100, t97:1, t90, t103, t83:1, t107, t71:1
  t40: ch,glue = CopyToReg t110, Register:f32 $s0, ConstantFP:f32<1.000000e+00>
  t42: ch,glue = CopyToReg t40, Register:i64 $x21, Register:i64 %1, t40:1
  t71: i32,ch = load<(load (s32) from %fixed-stack.0, align 16)> t45, FrameIndex:i64<-1>, undef:i64
    t69: i64 = or FrameIndex:i64<-1>, Constant:i64<8>
  t83: i32,ch = load<(load (s32), align 8)> t45, t69, undef:i64
    t81: i64 = add FrameIndex:i64<-1>, Constant:i64<16>
  t97: i32,ch = load<(load (s32) from %fixed-stack.0 + 16, align 16)> t45, t81, undef:i64
  t43: ch = AArch64ISD::RET_FLAG t42, Register:f32 $s0, Register:i64 $x21, t42:1

Post DAG:

SelectionDAG has 38 nodes:
      t0: ch = EntryToken
    t6: ch,glue = callseq_start t0, TargetConstant:i64<0>, TargetConstant:i64<0>
  t10: ch,glue = CopyToReg t6, Register:i64 $x0, Constant:i64<16>
  t13: ch,glue = AArch64ISD::CALL t10, TargetGlobalAddress:i64<i8* (i64)* @malloc> 0, Register:i64 $x0, RegisterMask:Untyped, t10:1
  t14: ch,glue = callseq_end t13, TargetConstant:i64<0>, TargetConstant:i64<0>, t13:1
  t15: i64,ch,glue = CopyFromReg t14, Register:i64 $x0, t14:1
  t17: ch = CopyToReg t15:1, Register:i64 %1, t15
    t19: i64 = add nuw t15, Constant:i64<8>
  t45: ch = store<(store (s8) into %ir.tmp), trunc to i8> t17, Constant:i32<1>, t19, undef:i64
      t99: ch = store<(store (s32) into %ir.a12)> t17, t97, FrameIndex:i64<3>, undef:i64
      t90: ch = store<(store (s64) into %ir.args)> t17, t97:1, FrameIndex:i64<0>, undef:i64
      t102: ch = store<(store (s32) into %ir.a11)> t17, t84, FrameIndex:i64<2>, undef:i64
      t106: ch = store<(store (s32) into %ir.a10)> t17, t70, FrameIndex:i64<1>, undef:i64
    t109: ch = TokenFactor t99, t97:2, t90, t102, t84:2, t106, t70:1
  t40: ch,glue = CopyToReg t109, Register:f32 $s0, ConstantFP:f32<1.000000e+00>
  t42: ch,glue = CopyToReg t40, Register:i64 $x21, Register:i64 %1, t40:1
  t70: i32,ch = load<(load (s32) from %fixed-stack.0, align 16)> t45, FrameIndex:i64<-1>, undef:i64
    t73: i64 = or FrameIndex:i64<-1>, Constant:i64<8>
  t84: i32,i64,ch = load<(load (s32), align 8), <post-inc>> t45, t73, Constant:i64<8>
  t97: i32,i64,ch = load<(load (s32)), <post-inc>> t45, t84:1, Constant:i64<8>
  t43: ch = AArch64ISD::RET_FLAG t42, Register:f32 $s0, Register:i64 $x21, t42:1

Once again, this looks like it is improving things at the DAGCombine level, but trips up later stages in the backend.

llvm/test/CodeGen/ARM/swifterror.ll
687

I have to apologize because I now notice a major error in my previous message. I swapped the two DAGs. The post-inc loads are in this diff's version of the DAG, and they do not exists on main. This has typically be the pattern with these changes: they allow DAGCombine to do more, but sometime, later stages in the pipeline gets confused. Nevertheless, if ldm is combine post register allocation, then it is indeed going to be fragile.

When you say that you wouldn't worry about this, do you mean that, because ldm is generally unreliable (as it happen post register allocation), then we should proceed regardless of this specific change, assuming there is no other outstanding open question?

efriedma added inline comments.May 6 2022, 5:29 PM
llvm/test/CodeGen/ARM/swifterror.ll
687

Oh, we're forming more post-inc loads? Then sure, this test change is fine as-is.

If it causes issues in practice, we can mess with the post-inc formation heuristics.

efriedma added inline comments.May 6 2022, 5:33 PM
llvm/test/CodeGen/AArch64/arm64-abi-varargs.ll
23

It looks like post-inc formation is triggering more on AArch64 as well?

I think it's just something we're going to have to watch out for; if it causes practical issues, we might need to make post-inc formation less aggressive.

deadalnix added inline comments.May 6 2022, 5:40 PM
llvm/test/CodeGen/ARM/swifterror.ll
687

Yes, we are forming more of them. In a way, this is the whole point of this patch: traverse the node in a way that exposes more combining opportunities, but in this case, it looks like we either don't want to do the combine as aggressively, or we want to handle it better in later stages. Regressions in the ARM and AArch64 backends are variation around that same theme.

Now the question is what do we do from there? Do we land this and try to mess with the heuristics?

efriedma added inline comments.May 6 2022, 5:51 PM
llvm/test/CodeGen/ARM/swifterror.ll
687

I don't think the specific cases here really represent a general trend... but it's going to be hard to tell for sure without actually landing it, and seeing the fallout.

If the cases here do represent a trend, probably the first thing I'd look at is restricting the formation of post-inc memory ops that refer to an offset from a frame index.

I ran some quick benchmarks and didn't notice anything alarming. Almost nothing changed across the AArch64/Thumb2/Thumb1 tests I tried (at least in terms of codesize or performance). There can always be different code that might behave differently, but no objections from me.

deadalnix updated this revision to Diff 427859.May 7 2022, 6:40 AM

Fix aix32-cc-abi-vaarg.ll

Ok, this looks like it is good to go then. @RKSimon , if this is good for you, then we shoudl land this.

RKSimon accepted this revision.May 7 2022, 8:47 AM

LGTM cheers

This revision is now accepted and ready to land.May 7 2022, 8:47 AM
fhahn added inline comments.May 9 2022, 1:58 AM
llvm/test/CodeGen/AArch64/swifterror.ll
946

It would be good to investigate what's blocking the load/store optimizer here.