This is an archive of the discontinued LLVM Phabricator instance.

Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality with 0
ClosedPublic

Authored by Allen on Oct 19 2022, 4:02 AM.

Diff Detail

Event Timeline

Allen created this revision.Oct 19 2022, 4:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2022, 4:02 AM
Allen requested review of this revision.Oct 19 2022, 4:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2022, 4:02 AM
efriedma added inline comments.Oct 19 2022, 10:02 AM
llvm/test/CodeGen/AArch64/i128-cmp.ll
119–121

Is there some reason we don't want to combine this to cmp+ccmp+b.ne?

Allen added inline comments.Oct 21 2022, 2:04 AM
llvm/test/CodeGen/AArch64/i128-cmp.ll
119–121
  • Thanks for your attention. This case is block by the constraintN->use_begin()->getOpcode() != ISD::BRCOND, as I can't confirm that there is necessarily a benefit in this scenario. such as case test_rmw_add_128 in file CodeGen/AArch64/atomicrmw-O0.ll. If we can ignore the regression of O0, then I can relex this constraint ?
SelectionDAG has 19 nodes:
  t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t6: i64,ch = CopyFromReg t0, Register:i64 %2
          t26: i64 = xor t2, t6
            t4: i64,ch = CopyFromReg t0, Register:i64 %1
            t8: i64,ch = CopyFromReg t0, Register:i64 %3
          t27: i64 = xor t4, t8
        t28: i64 = or t26, t27
      t22: i32 = setcc t28, Constant:i64<0>, setne:ch
    t21: ch = brcond t0, t22, BasicBlock:ch<exit 0xaaaab28f7268>
  t18: ch = br t21, BasicBlock:ch<call 0xaaaab28f7170>
  • This is the key change of case test_rmw_add_128 , which is compiled with -O0.
-; NOLSE-NEXT:    eor x11, x9, x11
-; NOLSE-NEXT:    eor x8, x10, x8
-; NOLSE-NEXT:    orr x8, x8, x11
+; NOLSE-NEXT:    mov x9, x8
 ; NOLSE-NEXT:    str x9, [sp, #8] // 8-byte Folded Spill
+; NOLSE-NEXT:    mov x10, x12
 ; NOLSE-NEXT:    str x10, [sp, #16] // 8-byte Folded Spill
+; NOLSE-NEXT:    subs x12, x12, x13
+; NOLSE-NEXT:    ccmp x8, x11, #0, eq
+; NOLSE-NEXT:    cset w8, eq
 ; NOLSE-NEXT:    str x10, [sp, #32] // 8-byte Folded Spill
 ; NOLSE-NEXT:    str x9, [sp, #40] // 8-byte Folded Spill
-; NOLSE-NEXT:    cbnz x8, .LBB4_1
+; NOLSE-NEXT:    tbnz w8, #0, .LBB4_1
efriedma added inline comments.Oct 21 2022, 10:35 AM
llvm/test/CodeGen/AArch64/i128-cmp.ll
119–121

We can mostly ignore codesize at -O0. (I mean, it matters to the extent that really bloated code can start to impact compile-time, but that isn't relevant here.)

Allen updated this revision to Diff 469832.Oct 21 2022, 5:32 PM
Allen marked an inline comment as done.

relax the constraint N->use_begin()->getOpcode() != ISD::BRCOND

llvm/test/CodeGen/AArch64/i128-cmp.ll
119–121

Done, Thank you for your guidance.

Could this be done during lowering, int AArch64TargetLowering::LowerSETCC, or does that not work?
The getNZCVToSatisfyCondCode method is useful for getting the constant needed for CCMP's.

llvm/test/CodeGen/AArch64/i128-cmp.ll
13

Are you sure this is correct? It doesn't look right.

I think I would expect ccmp #0, eq; cset eq.

23–24

And here it needs to set based on ne, so maybe ccmp #8, eq; cmp ne.

Those two verify as equivalent.

Allen added inline comments.Oct 22 2022, 7:22 AM
llvm/test/CodeGen/AArch64/i128-cmp.ll
13

Yes, I test the executive for the initial case in https://github.com/llvm/llvm-project/issues/58061

Allen marked 3 inline comments as done.Oct 22 2022, 7:38 AM
Allen added inline comments.
llvm/test/CodeGen/AArch64/i128-cmp.ll
23–24
Allen marked an inline comment as done.Oct 24 2022, 1:49 AM

Could this be done during lowering, int AArch64TargetLowering::LowerSETCC, or does that not work?
The getNZCVToSatisfyCondCode method is useful for getting the constant needed for CCMP's.

Thanks for your suggestion, I try to debug the function br_on_cmp_i128_eq in file CodeGen/AArch64/i128-cmp.ll, and find that the setcc is transform into br_cc in AArch64TargetLowering::LowerOperation
so I think it can also work in AArch64TargetLowering. Out of intresting, I'd like to know why you recommend processing in the AArch64TargetLowering?

Could this be done during lowering, int AArch64TargetLowering::LowerSETCC, or does that not work?
The getNZCVToSatisfyCondCode method is useful for getting the constant needed for CCMP's.

Thanks for your suggestion, I try to debug the function br_on_cmp_i128_eq in file CodeGen/AArch64/i128-cmp.ll, and find that the setcc is transform into br_cc in AArch64TargetLowering::LowerOperation
so I think it can also work in AArch64TargetLowering. Out of intresting, I'd like to know why you recommend processing in the AArch64TargetLowering?

I see. That sounds like a good reason not to do it in Lowering.

llvm/test/CodeGen/AArch64/i128-cmp.ll
13

How did you test that? Is it the code from https://gcc.godbolt.org/z/Tv1YP6bPc? Could it have been constant-folded away?

Allen added inline comments.Oct 24 2022, 5:56 AM
llvm/test/CodeGen/AArch64/i128-cmp.ll
13

Yes, I test the executive very simple, just run the following cmd with and without the changes.
~/llvm-project-upstream/build/bin/clang -march=armv8.2-a -O3 run.c -ffast-math;./a.out

efriedma added inline comments.Oct 24 2022, 11:07 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19495

Leftover comment about brcond

dmgreen added inline comments.Oct 24 2022, 11:33 AM
llvm/test/CodeGen/AArch64/i128-cmp.ll
13

Have you tried with -fno-inline? The example godbolt link has everything inlined into main, and constant folded into the arguments of the printf's. It doesn't look like it is really testing the codegen.
I would expect the code to be the same as this for eq, due to the way i128 is split into i64 registers: https://godbolt.org/z/aed4bYn6n

Allen marked an inline comment as done.Oct 25 2022, 12:31 AM
Allen added inline comments.
llvm/test/CodeGen/AArch64/i128-cmp.ll
13

oh, thanks very much, you are right.

But how can I get the #8 for case cmp_i128_ne ? it seem the input Code should be MI or LT when it return expect "#8" with function getNZCVToSatisfyCondCode ?

dmgreen added inline comments.Oct 25 2022, 12:49 AM
llvm/test/CodeGen/AArch64/i128-cmp.ll
13

Yeah, #8 is not the only choice. I think it can be any constant where the the Z bit is clear, so #0 is fine (as is #8). #4 is not because that is the Z bit.

Because the eq is a && and the ne is a ||, it might be simpler to just use a constant of 0 in both cases, providing that works.

Allen updated this revision to Diff 470406.Oct 25 2022, 1:24 AM
Allen marked an inline comment as done.

Add AArch64CC::getInvertedCondCode(CC) to fix the runtime issue

Allen updated this revision to Diff 470407.Oct 25 2022, 1:33 AM
Allen marked 3 inline comments as done.

update tests

Allen added inline comments.Oct 25 2022, 1:34 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19495

Done

llvm/test/CodeGen/AArch64/i128-cmp.ll
13

Apply your comment, thanks for guidance

dmgreen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19503

0 -> AArch64CC::EQ.

19506

I'm not sure if this should be MVT::Glue or MVT::i32. It seems to be created differently in different places.

19509–19510

This comment doesn't seem very helpful as a code-comment.

19511

AArch64CC::EQ -> 0 is probably better. It is not a condition, but the value the NZCV flags are set to.

Allen updated this revision to Diff 470433.Oct 25 2022, 3:53 AM
Allen marked an inline comment as done.

address comment

Allen marked 3 inline comments as done.Oct 25 2022, 3:54 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19506

I don't very sure this is the accurate answer, it seems the MVT::Glue implicit instructions are scheduled together?
https://lists.llvm.org/pipermail/llvm-dev/2014-June/074046.html

19511

Apply your comment, thanks

bcl5980 added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500

LHS should be OneUse also?

19506

You haven't pass the glue to other instructions so the glue is useless. And I think we needn't use Glue here.

19510

I am not sure if we can just combine to ISD::SETCC ? Maybe it can combine with some other op.

Allen marked an inline comment as done.Oct 25 2022, 8:35 AM

need rebase as fail in case llvm/test/CodeGen/AArch64/bcmp.ll, which is new precommited in e95c74b423c

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500

The LHS node itself is not used in the return value when the pattern matched, so I don't think the OneUse is needed, correct me if I'm wrong, thanks.

19506

Thanks, I'll updated it.

19510

sorry, I don't understand what is the ISD::SETCC, could you please show more detailedly? as I don't find it in my changes.

bcl5980 added inline comments.Oct 25 2022, 8:03 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500

for example:

int use(int);
int f(int a, int b, int c, int d)
{
   int xor0 = a ^ b;
   int xor1 = c ^ d;
   int or0   = xor0 | xor1;
   if (or0 != 0)
        return use(or0);
   return a;
}

or0 is not one use. So we should keep all of the xor+or patterns.

19510

The code should be simpler by combine to SetCC:

SDValue XOR0 = LHS.getOperand(0);
SDValue XOR1 = LHS.getOperand(1);
SDValue Cmp0 = DAG.getSetCC(DL, VT, XOR0.getOperand(0), XOR0.getOperand(1),
                            ISD::SETNE);
SDValue Cmp1 = DAG.getSetCC(DL, VT, XOR1.getOperand(0), XOR1.getOperand(1),
                            ISD::SETNE);
SDValue Cmp = DAG.getNode(ISD::OR, DL, VT, Cmp0, Cmp1);
return DAG.getSetCC(DL, VT, Cmp, DAG.getConstant(0, DL, VT), Cond);

But may fall into potential dead loop if somewhere has the reverse combination.
@dmgreen , which way do you think is better?

Allen added inline comments.Oct 26 2022, 12:49 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500
  • thanks for your case. If the or0 used more than one, then the xor+or patterns will be keep as we delete then when we match the pattern
  • base your above case, it seems works, https://alive2.llvm.org/ce/z/699vcf Of course, this match doesn't depend on that either, and I can add oneuse of LHS if you still worry about it.
bcl5980 added inline comments.Oct 26 2022, 1:36 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500

It looks your case source instructions is less than dest?

Allen marked an inline comment as done.Oct 26 2022, 2:25 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500

I'm just suggesting the multi-use is allowed base alive2
as I don't know how express the ccmp instruction, this is not accurate.
ok, I'll add the oneuse of LHS, as it is still not fully agreed, thanks very much.

dmgreen added inline comments.Oct 26 2022, 6:53 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500

I see. Can you add the testcase where the LHS has multiple uses:

define i32 @multiuse(i32 %0, i32 %1, i32 %2, i32 %3) {
  %5 = xor i32 %1, %0
  %6 = xor i32 %3, %2
  %7 = or i32 %6, %5
  %8 = icmp eq i32 %7, 0
  br i1 %8, label %11, label %9

9:                                                ; preds = %4
  %10 = tail call i32 @use(i32 %7) #2
  br label %11

11:                                               ; preds = %4, %9
  %12 = phi i32 [ %10, %9 ], [ %0, %4 ]
  ret i32 %12
}

We just need to make sure it doesn't increase the number of instructions.

19506

Glue is probably OK. Can you add this test case:

define i32 @eq(i128 noundef %x, i128 noundef %y) {
entry:
  %cmp3 = icmp eq i128 %x, %y
  %conv = trunc i128 %x to i64
  %conv1 = trunc i128 %y to i64
  %cmp = icmp eq i64 %conv, %conv1
  %or7 = or i1 %cmp3, %cmp
  %or = zext i1 %or7 to i32
  ret i32 %or
}

There may be issues with the CMP/CCMP with the scheduling of instructions that ISel will create out of the DAG, but I've not seen any happen yet.

19510

Hmm. It is not worth it if we are taking two steps to do what we could do in one. But there could be further DAG combiners for the setcc. I'd say this method is fine so long as we don't do it too early. If we find cases where there are missing combines, we can always add extra folds for the CMP/CCMPs.

Allen updated this revision to Diff 470991.Oct 26 2022, 7:18 PM
Allen marked an inline comment as done.

Add LHS.hasOneUse() and 2 more cases

dmgreen accepted this revision.Oct 27 2022, 11:50 AM

Thanks. LGTM

This revision is now accepted and ready to land.Oct 27 2022, 11:50 AM

hi, I think this patch is triggering an assert failure in SelectionDAG.

I've filed https://github.com/llvm/llvm-project/issues/58675 to track it, with the reproducer. I've bisected the issue to this commit, and reverting locally stops the assert from triggering on the reproducer.

Can you take a look a the issue, and revert this until a fix is available?

This revision is now accepted and ready to land.Oct 28 2022, 4:28 PM
Allen marked an inline comment as done.Oct 29 2022, 4:18 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19506

As the crash of https://github.com/llvm/llvm-project/issues/58675, we need update the MVT::glue into MVT::i32

Allen updated this revision to Diff 471745.Oct 29 2022, 4:45 AM
Allen marked an inline comment as done.
Allen retitled this revision from [AArch64] Optimize memcmp when the result is tested for [in]equality with 0 to Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality with 0.
Allen edited the summary of this revision. (Show Details)