This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
16/23
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
atomicrmw-O0.ll
-
bcmp-inline-small.ll
-
bcmp.ll
-
dag-combine-setcc.ll
14/14
i128-cmp.ll
-
umulo-128-legalisation-lowering.ll

Differential D136244

Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality with 0
ClosedPublic

Authored by Allen on Oct 19 2022, 4:02 AM.

Download Raw Diff

Details

Reviewers

spatel
paulwalker-arm
efriedma
dmgreen

Commits

rG63a46385f2c6: Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality…
rG01ff511593d1: [AArch64] Optimize memcmp when the result is tested for [in]equality with 0

Summary

Fixes 1st issue of https://github.com/llvm/llvm-project/issues/58061
Fixes the crash of https://github.com/llvm/llvm-project/issues/58675

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Allen created this revision.Oct 19 2022, 4:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2022, 4:02 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

Allen requested review of this revision.Oct 19 2022, 4:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2022, 4:02 AM

Herald added a subscriber: llvm-commits. · View Herald TranscriptOct 19 2022, 4:02 AM

Dushistov added a subscriber: Dushistov.Oct 19 2022, 4:55 AM

Harbormaster completed remote builds in B192967: Diff 468863.Oct 19 2022, 5:16 AM

efriedma added inline comments.Oct 19 2022, 10:02 AM

llvm/test/CodeGen/AArch64/i128-cmp.ll
119–121	Is there some reason we don't want to combine this to cmp+ccmp+b.ne?

Allen added inline comments.Oct 21 2022, 2:04 AM

llvm/test/CodeGen/AArch64/i128-cmp.ll

119–121

Thanks for your attention. This case is block by the constraintN->use_begin()->getOpcode() != ISD::BRCOND, as I can't confirm that there is necessarily a benefit in this scenario. such as case test_rmw_add_128 in file CodeGen/AArch64/atomicrmw-O0.ll. If we can ignore the regression of O0, then I can relex this constraint ?

SelectionDAG has 19 nodes:
  t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t6: i64,ch = CopyFromReg t0, Register:i64 %2
          t26: i64 = xor t2, t6
            t4: i64,ch = CopyFromReg t0, Register:i64 %1
            t8: i64,ch = CopyFromReg t0, Register:i64 %3
          t27: i64 = xor t4, t8
        t28: i64 = or t26, t27
      t22: i32 = setcc t28, Constant:i64<0>, setne:ch
    t21: ch = brcond t0, t22, BasicBlock:ch<exit 0xaaaab28f7268>
  t18: ch = br t21, BasicBlock:ch<call 0xaaaab28f7170>

This is the key change of case test_rmw_add_128 , which is compiled with -O0.

-; NOLSE-NEXT:    eor x11, x9, x11
-; NOLSE-NEXT:    eor x8, x10, x8
-; NOLSE-NEXT:    orr x8, x8, x11
+; NOLSE-NEXT:    mov x9, x8
 ; NOLSE-NEXT:    str x9, [sp, #8] // 8-byte Folded Spill
+; NOLSE-NEXT:    mov x10, x12
 ; NOLSE-NEXT:    str x10, [sp, #16] // 8-byte Folded Spill
+; NOLSE-NEXT:    subs x12, x12, x13
+; NOLSE-NEXT:    ccmp x8, x11, #0, eq
+; NOLSE-NEXT:    cset w8, eq
 ; NOLSE-NEXT:    str x10, [sp, #32] // 8-byte Folded Spill
 ; NOLSE-NEXT:    str x9, [sp, #40] // 8-byte Folded Spill
-; NOLSE-NEXT:    cbnz x8, .LBB4_1
+; NOLSE-NEXT:    tbnz w8, #0, .LBB4_1

efriedma added inline comments.Oct 21 2022, 10:35 AM

llvm/test/CodeGen/AArch64/i128-cmp.ll
119–121	We can mostly ignore codesize at -O0. (I mean, it matters to the extent that really bloated code can start to impact compile-time, but that isn't relevant here.)

relax the constraint N->use_begin()->getOpcode() != ISD::BRCOND

llvm/test/CodeGen/AArch64/i128-cmp.ll
119–121	Done, Thank you for your guidance.

Harbormaster completed remote builds in B193692: Diff 469832.Oct 21 2022, 6:38 PM

Could this be done during lowering, int AArch64TargetLowering::LowerSETCC, or does that not work?
The getNZCVToSatisfyCondCode method is useful for getting the constant needed for CCMP's.

llvm/test/CodeGen/AArch64/i128-cmp.ll
13	Are you sure this is correct? It doesn't look right. I think I would expect `ccmp #0, eq; cset eq`.
23–24	And here it needs to set based on ne, so maybe `ccmp #8, eq; cmp ne`. Those two verify as equivalent.

Allen added inline comments.Oct 22 2022, 7:22 AM

llvm/test/CodeGen/AArch64/i128-cmp.ll
13	Yes, I test the executive for the initial case in https://github.com/llvm/llvm-project/issues/58061

Allen marked 3 inline comments as done.Oct 22 2022, 7:38 AM

Allen added inline comments.

llvm/test/CodeGen/AArch64/i128-cmp.ll
23–24	can also find the cases in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

In D136244#3876986, @dmgreen wrote:

Could this be done during lowering, int AArch64TargetLowering::LowerSETCC, or does that not work?
The getNZCVToSatisfyCondCode method is useful for getting the constant needed for CCMP's.

Thanks for your suggestion, I try to debug the function br_on_cmp_i128_eq in file CodeGen/AArch64/i128-cmp.ll, and find that the setcc is transform into br_cc in AArch64TargetLowering::LowerOperation
so I think it can also work in AArch64TargetLowering. Out of intresting, I'd like to know why you recommend processing in the AArch64TargetLowering?

In D136244#3878647, @Allen wrote:

In D136244#3876986, @dmgreen wrote:

Could this be done during lowering, int AArch64TargetLowering::LowerSETCC, or does that not work?
The getNZCVToSatisfyCondCode method is useful for getting the constant needed for CCMP's.

Thanks for your suggestion, I try to debug the function br_on_cmp_i128_eq in file CodeGen/AArch64/i128-cmp.ll, and find that the setcc is transform into br_cc in AArch64TargetLowering::LowerOperation
so I think it can also work in AArch64TargetLowering. Out of intresting, I'd like to know why you recommend processing in the AArch64TargetLowering?

I see. That sounds like a good reason not to do it in Lowering.

llvm/test/CodeGen/AArch64/i128-cmp.ll
13	How did you test that? Is it the code from https://gcc.godbolt.org/z/Tv1YP6bPc? Could it have been constant-folded away?

Allen added inline comments.Oct 24 2022, 5:56 AM

llvm/test/CodeGen/AArch64/i128-cmp.ll
13	Yes, I test the executive very simple, just run the following cmd with and without the changes. ~/llvm-project-upstream/build/bin/clang -march=armv8.2-a -O3 run.c -ffast-math;./a.out

efriedma added inline comments.Oct 24 2022, 11:07 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19495	Leftover comment about brcond

dmgreen added inline comments.Oct 24 2022, 11:33 AM

llvm/test/CodeGen/AArch64/i128-cmp.ll
13	Have you tried with -fno-inline? The example godbolt link has everything inlined into main, and constant folded into the arguments of the printf's. It doesn't look like it is really testing the codegen. I would expect the code to be the same as this for eq, due to the way i128 is split into i64 registers: https://godbolt.org/z/aed4bYn6n

Allen marked an inline comment as done.Oct 25 2022, 12:31 AM

Allen added inline comments.

llvm/test/CodeGen/AArch64/i128-cmp.ll
13	oh, thanks very much, you are right. But how can I get the #8 for case cmp_i128_ne ? it seem the input Code should be MI or LT when it return expect "#8" with function getNZCVToSatisfyCondCode ?

dmgreen added inline comments.Oct 25 2022, 12:49 AM

llvm/test/CodeGen/AArch64/i128-cmp.ll
13	Yeah, #8 is not the only choice. I think it can be any constant where the the Z bit is clear, so #0 is fine (as is #8). #4 is not because that is the Z bit. Because the eq is a && and the ne is a \|\|, it might be simpler to just use a constant of 0 in both cases, providing that works.

Add AArch64CC::getInvertedCondCode(CC) to fix the runtime issue

update tests

Allen added inline comments.Oct 25 2022, 1:34 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19495	Done
llvm/test/CodeGen/AArch64/i128-cmp.ll
13	Apply your comment, thanks for guidance

dmgreen mentioned this in D136672: [ExpandMemCmp][AArch64] Add a new option PreferCmpToExpand in inMemCmpExpansionOptions and enable on AArch64.Oct 25 2022, 3:18 AM

dmgreen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19503	0 -> AArch64CC::EQ.
19506	I'm not sure if this should be MVT::Glue or MVT::i32. It seems to be created differently in different places.
19509–19510	This comment doesn't seem very helpful as a code-comment.
19511	AArch64CC::EQ -> 0 is probably better. It is not a condition, but the value the NZCV flags are set to.

address comment

Allen marked 3 inline comments as done.Oct 25 2022, 3:54 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19506	I don't very sure this is the accurate answer, it seems the MVT::Glue implicit instructions are scheduled together? https://lists.llvm.org/pipermail/llvm-dev/2014-June/074046.html
19511	Apply your comment, thanks

Harbormaster completed remote builds in B194136: Diff 470433.Oct 25 2022, 5:33 AM

bcl5980 added a subscriber: bcl5980.Oct 25 2022, 6:57 AM

bcl5980 added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500	LHS should be OneUse also?
19506	You haven't pass the glue to other instructions so the glue is useless. And I think we needn't use Glue here.
19510	I am not sure if we can just combine to ISD::SETCC ? Maybe it can combine with some other op.

need rebase as fail in case llvm/test/CodeGen/AArch64/bcmp.ll, which is new precommited in e95c74b423c

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500	The LHS node itself is not used in the return value when the pattern matched, so I don't think the OneUse is needed, correct me if I'm wrong, thanks.
19506	Thanks, I'll updated it.
19510	sorry, I don't understand what is the ISD::SETCC, could you please show more detailedly? as I don't find it in my changes.

bcl5980 added inline comments.Oct 25 2022, 8:03 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

19499–19500

for example:

int use(int);
int f(int a, int b, int c, int d)
{
   int xor0 = a ^ b;
   int xor1 = c ^ d;
   int or0   = xor0 | xor1;
   if (or0 != 0)
        return use(or0);
   return a;
}

or0 is not one use. So we should keep all of the xor+or patterns.

19510

The code should be simpler by combine to SetCC:

SDValue XOR0 = LHS.getOperand(0);
SDValue XOR1 = LHS.getOperand(1);
SDValue Cmp0 = DAG.getSetCC(DL, VT, XOR0.getOperand(0), XOR0.getOperand(1),
                            ISD::SETNE);
SDValue Cmp1 = DAG.getSetCC(DL, VT, XOR1.getOperand(0), XOR1.getOperand(1),
                            ISD::SETNE);
SDValue Cmp = DAG.getNode(ISD::OR, DL, VT, Cmp0, Cmp1);
return DAG.getSetCC(DL, VT, Cmp, DAG.getConstant(0, DL, VT), Cond);

But may fall into potential dead loop if somewhere has the reverse combination.
@dmgreen , which way do you think is better?

Allen added inline comments.Oct 26 2022, 12:49 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500	thanks for your case. If the or0 used more than one, then the xor+or patterns will be keep as we delete then when we match the pattern base your above case, it seems works, https://alive2.llvm.org/ce/z/699vcf Of course, this match doesn't depend on that either, and I can add oneuse of LHS if you still worry about it.

bcl5980 added inline comments.Oct 26 2022, 1:36 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500	It looks your case source instructions is less than dest?

Allen marked an inline comment as done.Oct 26 2022, 2:25 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500	I'm just suggesting the multi-use is allowed base alive2 as I don't know how express the ccmp instruction, this is not accurate. ok, I'll add the oneuse of LHS, as it is still not fully agreed, thanks very much.

dmgreen added inline comments.Oct 26 2022, 6:53 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19499–19500	I see. Can you add the testcase where the LHS has multiple uses: define i32 @multiuse(i32 %0, i32 %1, i32 %2, i32 %3) { %5 = xor i32 %1, %0 %6 = xor i32 %3, %2 %7 = or i32 %6, %5 %8 = icmp eq i32 %7, 0 br i1 %8, label %11, label %9 9: ; preds = %4 %10 = tail call i32 @use(i32 %7) #2 br label %11 11: ; preds = %4, %9 %12 = phi i32 [ %10, %9 ], [ %0, %4 ] ret i32 %12 } We just need to make sure it doesn't increase the number of instructions.
19506	Glue is probably OK. Can you add this test case: define i32 @eq(i128 noundef %x, i128 noundef %y) { entry: %cmp3 = icmp eq i128 %x, %y %conv = trunc i128 %x to i64 %conv1 = trunc i128 %y to i64 %cmp = icmp eq i64 %conv, %conv1 %or7 = or i1 %cmp3, %cmp %or = zext i1 %or7 to i32 ret i32 %or } There may be issues with the CMP/CCMP with the scheduling of instructions that ISel will create out of the DAG, but I've not seen any happen yet.
19510	Hmm. It is not worth it if we are taking two steps to do what we could do in one. But there could be further DAG combiners for the setcc. I'd say this method is fine so long as we don't do it too early. If we find cases where there are missing combines, we can always add extra folds for the CMP/CCMPs.

Add LHS.hasOneUse() and 2 more cases

Harbormaster completed remote builds in B194543: Diff 470991.Oct 26 2022, 7:58 PM

Thanks. LGTM

This revision is now accepted and ready to land.Oct 27 2022, 11:50 AM

Closed by commit rG01ff511593d1: [AArch64] Optimize memcmp when the result is tested for [in]equality with 0 (authored by Allen). · Explain WhyOct 27 2022, 4:57 PM

This revision was automatically updated to reflect the committed changes.

Allen added a commit: rG01ff511593d1: [AArch64] Optimize memcmp when the result is tested for [in]equality with 0.

hi, I think this patch is triggering an assert failure in SelectionDAG.

I've filed https://github.com/llvm/llvm-project/issues/58675 to track it, with the reproducer. I've bisected the issue to this commit, and reverting locally stops the assert from triggering on the reproducer.

Can you take a look a the issue, and revert this until a fix is available?

paulkirth added a reverting change: rG1c0681757669: Revert "[AArch64] Optimize memcmp when the result is tested for [in]equality….Oct 28 2022, 4:25 PM

I've reverted this for now in rG1c0681757669880bda144aeb56dcad6901a2016b

This revision is now accepted and ready to land.Oct 28 2022, 4:28 PM

Allen marked an inline comment as done.Oct 29 2022, 4:18 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19506	As the crash of https://github.com/llvm/llvm-project/issues/58675, we need update the MVT::glue into MVT::i32

Fix the crash of https://github.com/llvm/llvm-project/issues/58675

Harbormaster completed remote builds in B195086: Diff 471745.Oct 29 2022, 5:24 AM

Closed by commit rG63a46385f2c6: Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality… (authored by Allen). · Explain WhyOct 29 2022, 11:05 AM

This revision was automatically updated to reflect the committed changes.

Allen added a commit: rG63a46385f2c6: Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality….

Allen mentioned this in D137721: [AArch64] Optimize more memcmp when the result is tested for [in]equality with 0.Nov 9 2022, 8:31 AM

Allen added a child revision: D137721: [AArch64] Optimize more memcmp when the result is tested for [in]equality with 0.Nov 12 2022, 7:54 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

29 lines

test/

CodeGen/

AArch64/

132 lines

12 lines

36 lines

120 lines

26 lines

umulo-128-legalisation-lowering.ll

8 lines

Diff 471760

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 19,484 Lines • ▼ Show 20 Lines	if (DCI.isBeforeLegalize() && VT.isScalarInteger() &&
if (FromVT.isFixedLengthVector() &&		if (FromVT.isFixedLengthVector() &&
FromVT.getVectorElementType() == MVT::i1) {		FromVT.getVectorElementType() == MVT::i1) {
LHS = DAG.getNode(ISD::VECREDUCE_OR, DL, MVT::i1, LHS->getOperand(0));		LHS = DAG.getNode(ISD::VECREDUCE_OR, DL, MVT::i1, LHS->getOperand(0));
LHS = DAG.getNode(ISD::ZERO_EXTEND, DL, ToVT, LHS);		LHS = DAG.getNode(ISD::ZERO_EXTEND, DL, ToVT, LHS);
return DAG.getSetCC(DL, VT, LHS, RHS, Cond);		return DAG.getSetCC(DL, VT, LHS, RHS, Cond);
}		}
}		}

		// Try to express conjunction "cmp 0 (or (xor A0 A1) (xor B0 B1))" as:
		// cmp A0, A0; ccmp A0, B1, 0, eq; cmp inv(Cond) flag
		if (!DCI.isBeforeLegalize() && VT.isScalarInteger() &&
		efriedmaUnsubmitted Done Reply Inline Actions Leftover comment about brcond efriedma: Leftover comment about brcond
		AllenAuthorUnsubmitted Done Reply Inline Actions Done Allen: Done
		(Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) && isNullConstant(RHS) &&
		LHS->getOpcode() == ISD::OR &&
		(LHS.getOperand(0)->getOpcode() == ISD::XOR &&
		LHS.getOperand(1)->getOpcode() == ISD::XOR) &&
		LHS.hasOneUse() && LHS.getOperand(0)->hasOneUse() &&
		bcl5980Unsubmitted Not Done Reply Inline Actions LHS should be OneUse also? bcl5980: LHS should be OneUse also?
		AllenAuthorUnsubmitted Done Reply Inline Actions The LHS node itself is not used in the return value when the pattern matched, so I don't think the OneUse is needed, correct me if I'm wrong, thanks. Allen: The LHS node itself is not used in the return value when the pattern matched, so I don't…
		bcl5980Unsubmitted Not Done Reply Inline Actions for example: int use(int); int f(int a, int b, int c, int d) { int xor0 = a ^ b; int xor1 = c ^ d; int or0 = xor0 \| xor1; if (or0 != 0) return use(or0); return a; } or0 is not one use. So we should keep all of the xor+or patterns. bcl5980: for example: ``` int use(int); int f(int a, int b, int c, int d) { int xor0 = a ^ b; int…
		AllenAuthorUnsubmitted Done Reply Inline Actions thanks for your case. If the or0 used more than one, then the xor+or patterns will be keep as we delete then when we match the pattern base your above case, it seems works, https://alive2.llvm.org/ce/z/699vcf Of course, this match doesn't depend on that either, and I can add oneuse of LHS if you still worry about it. Allen: * thanks for your case. If the or0 used more than one, then the xor+or patterns will be keep…
		bcl5980Unsubmitted Done Reply Inline Actions It looks your case source instructions is less than dest? bcl5980: It looks your case source instructions is less than dest?
		AllenAuthorUnsubmitted Done Reply Inline Actions I'm just suggesting the multi-use is allowed base alive2 as I don't know how express the ccmp instruction, this is not accurate. ok, I'll add the oneuse of LHS, as it is still not fully agreed, thanks very much. Allen: I'm just suggesting the multi-use is allowed base alive2 as I don't know how express the ccmp…
		dmgreenUnsubmitted Not Done Reply Inline Actions I see. Can you add the testcase where the LHS has multiple uses: define i32 @multiuse(i32 %0, i32 %1, i32 %2, i32 %3) { %5 = xor i32 %1, %0 %6 = xor i32 %3, %2 %7 = or i32 %6, %5 %8 = icmp eq i32 %7, 0 br i1 %8, label %11, label %9 9: ; preds = %4 %10 = tail call i32 @use(i32 %7) #2 br label %11 11: ; preds = %4, %9 %12 = phi i32 [ %10, %9 ], [ %0, %4 ] ret i32 %12 } We just need to make sure it doesn't increase the number of instructions. dmgreen: I see. Can you add the testcase where the LHS has multiple uses: ``` define i32 @multiuse(i32…
		LHS.getOperand(1)->hasOneUse()) {
		SDValue XOR0 = LHS.getOperand(0);
		SDValue XOR1 = LHS.getOperand(1);
		dmgreenUnsubmitted Done Reply Inline Actions 0 -> AArch64CC::EQ. dmgreen: 0 -> AArch64CC::EQ.
		SDValue CCVal = DAG.getConstant(AArch64CC::EQ, DL, MVT_CC);
		EVT TstVT = LHS->getValueType(0);
		SDValue Cmp =
		dmgreenUnsubmitted Done Reply Inline Actions I'm not sure if this should be MVT::Glue or MVT::i32. It seems to be created differently in different places. dmgreen: I'm not sure if this should be MVT::Glue or MVT::i32. It seems to be created differently in…
		AllenAuthorUnsubmitted Done Reply Inline Actions I don't very sure this is the accurate answer, it seems the MVT::Glue implicit instructions are scheduled together? https://lists.llvm.org/pipermail/llvm-dev/2014-June/074046.html Allen: I don't very sure this is the accurate answer, it seems the MVT::Glue implicit…
		dmgreenUnsubmitted Done Reply Inline Actions Glue is probably OK. Can you add this test case: define i32 @eq(i128 noundef %x, i128 noundef %y) { entry: %cmp3 = icmp eq i128 %x, %y %conv = trunc i128 %x to i64 %conv1 = trunc i128 %y to i64 %cmp = icmp eq i64 %conv, %conv1 %or7 = or i1 %cmp3, %cmp %or = zext i1 %or7 to i32 ret i32 %or } There may be issues with the CMP/CCMP with the scheduling of instructions that ISel will create out of the DAG, but I've not seen any happen yet. dmgreen: Glue is probably OK. Can you add this test case: ``` define i32 @eq(i128 noundef %x, i128…
		AllenAuthorUnsubmitted Done Reply Inline Actions As the crash of https://github.com/llvm/llvm-project/issues/58675, we need update the MVT::glue into MVT::i32 Allen: As the crash of https://github.com/llvm/llvm-project/issues/58675, we need update the **MVT…
		bcl5980Unsubmitted Not Done Reply Inline Actions You haven't pass the glue to other instructions so the glue is useless. And I think we needn't use Glue here. bcl5980: You haven't pass the glue to other instructions so the glue is useless. And I think we needn't…
		AllenAuthorUnsubmitted Done Reply Inline Actions Thanks, I'll updated it. Allen: Thanks, I'll updated it.
		DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(TstVT, MVT::i32),
		XOR0.getOperand(0), XOR0.getOperand(1));
		SDValue Overflow = Cmp.getValue(1);
		SDValue NZCVOp = DAG.getConstant(0, DL, MVT::i32);
		dmgreenUnsubmitted Done Reply Inline Actions This comment doesn't seem very helpful as a code-comment. dmgreen: This comment doesn't seem very helpful as a code-comment.
		bcl5980Unsubmitted Not Done Reply Inline Actions I am not sure if we can just combine to ISD::SETCC ? Maybe it can combine with some other op. bcl5980: I am not sure if we can just combine to ISD::SETCC ? Maybe it can combine with some other op.
		AllenAuthorUnsubmitted Done Reply Inline Actions sorry, I don't understand what is the ISD::SETCC, could you please show more detailedly? as I don't find it in my changes. Allen: sorry, I don't understand what is the ISD::SETCC, could you please show more detailedly? as…
		bcl5980Unsubmitted Not Done Reply Inline Actions The code should be simpler by combine to SetCC: SDValue XOR0 = LHS.getOperand(0); SDValue XOR1 = LHS.getOperand(1); SDValue Cmp0 = DAG.getSetCC(DL, VT, XOR0.getOperand(0), XOR0.getOperand(1), ISD::SETNE); SDValue Cmp1 = DAG.getSetCC(DL, VT, XOR1.getOperand(0), XOR1.getOperand(1), ISD::SETNE); SDValue Cmp = DAG.getNode(ISD::OR, DL, VT, Cmp0, Cmp1); return DAG.getSetCC(DL, VT, Cmp, DAG.getConstant(0, DL, VT), Cond); But may fall into potential dead loop if somewhere has the reverse combination. @dmgreen , which way do you think is better? bcl5980: The code should be simpler by combine to SetCC: ``` SDValue XOR0 = LHS.getOperand(0)…
		dmgreenUnsubmitted Not Done Reply Inline Actions Hmm. It is not worth it if we are taking two steps to do what we could do in one. But there could be further DAG combiners for the setcc. I'd say this method is fine so long as we don't do it too early. If we find cases where there are missing combines, we can always add extra folds for the CMP/CCMPs. dmgreen: Hmm. It is not worth it if we are taking two steps to do what we could do in one. But there…
		SDValue CCmp = DAG.getNode(AArch64ISD::CCMP, DL, MVT_CC, XOR1.getOperand(0),
		dmgreenUnsubmitted Done Reply Inline Actions AArch64CC::EQ -> 0 is probably better. It is not a condition, but the value the NZCV flags are set to. dmgreen: AArch64CC::EQ -> 0 is probably better. It is not a condition, but the value the NZCV flags are…
		AllenAuthorUnsubmitted Done Reply Inline Actions Apply your comment, thanks Allen: Apply your comment, thanks
		XOR1.getOperand(1), NZCVOp, CCVal, Overflow);
		// Invert CSEL's operands.
		SDValue TVal = DAG.getConstant(1, DL, VT);
		SDValue FVal = DAG.getConstant(0, DL, VT);
		AArch64CC::CondCode CC = changeIntCCToAArch64CC(Cond);
		AArch64CC::CondCode InvCC = AArch64CC::getInvertedCondCode(CC);
		return DAG.getNode(AArch64ISD::CSEL, DL, VT, FVal, TVal,
		DAG.getConstant(InvCC, DL, MVT::i32), CCmp);
		}

return SDValue();		return SDValue();
}		}

// Replace a flag-setting operator (eg ANDS) with the generic version		// Replace a flag-setting operator (eg ANDS) with the generic version
// (eg AND) if the flag is unused.		// (eg AND) if the flag is unused.
static SDValue performFlagSettingCombine(SDNode *N,		static SDValue performFlagSettingCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
unsigned GenericOpcode) {		unsigned GenericOpcode) {
▲ Show 20 Lines • Show All 3,514 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/atomicrmw-O0.ll

	Show First 20 Lines • Show All 210 Lines • ▼ Show 20 Lines
	; NOLSE-NEXT: ldr x9, [x0]			; NOLSE-NEXT: ldr x9, [x0]
	; NOLSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill			; NOLSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill
	; NOLSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill			; NOLSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill
	; NOLSE-NEXT: b .LBB4_1			; NOLSE-NEXT: b .LBB4_1
	; NOLSE-NEXT: .LBB4_1: // %atomicrmw.start			; NOLSE-NEXT: .LBB4_1: // %atomicrmw.start
	; NOLSE-NEXT: // =>This Loop Header: Depth=1			; NOLSE-NEXT: // =>This Loop Header: Depth=1
	; NOLSE-NEXT: // Child Loop BB4_2 Depth 2			; NOLSE-NEXT: // Child Loop BB4_2 Depth 2
	; NOLSE-NEXT: ldr x11, [sp, #40] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x11, [sp, #40] // 8-byte Folded Reload
	; NOLSE-NEXT: ldr x8, [sp, #32] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x13, [sp, #32] // 8-byte Folded Reload
	; NOLSE-NEXT: ldr x13, [sp, #24] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x10, [sp, #24] // 8-byte Folded Reload
	; NOLSE-NEXT: adds x14, x8, #1			; NOLSE-NEXT: adds x14, x13, #1
	; NOLSE-NEXT: cinc x15, x11, hs			; NOLSE-NEXT: cinc x15, x11, hs
	; NOLSE-NEXT: .LBB4_2: // %atomicrmw.start			; NOLSE-NEXT: .LBB4_2: // %atomicrmw.start
	; NOLSE-NEXT: // Parent Loop BB4_1 Depth=1			; NOLSE-NEXT: // Parent Loop BB4_1 Depth=1
	; NOLSE-NEXT: // => This Inner Loop Header: Depth=2			; NOLSE-NEXT: // => This Inner Loop Header: Depth=2
	; NOLSE-NEXT: ldaxp x10, x9, [x13]			; NOLSE-NEXT: ldaxp x12, x8, [x10]
	; NOLSE-NEXT: cmp x10, x8			; NOLSE-NEXT: cmp x12, x13
	; NOLSE-NEXT: cset w12, ne			; NOLSE-NEXT: cset w9, ne
	; NOLSE-NEXT: cmp x9, x11			; NOLSE-NEXT: cmp x8, x11
	; NOLSE-NEXT: cinc w12, w12, ne			; NOLSE-NEXT: cinc w9, w9, ne
	; NOLSE-NEXT: cbnz w12, .LBB4_4			; NOLSE-NEXT: cbnz w9, .LBB4_4
	; NOLSE-NEXT: // %bb.3: // %atomicrmw.start			; NOLSE-NEXT: // %bb.3: // %atomicrmw.start
	; NOLSE-NEXT: // in Loop: Header=BB4_2 Depth=2			; NOLSE-NEXT: // in Loop: Header=BB4_2 Depth=2
	; NOLSE-NEXT: stlxp w12, x14, x15, [x13]			; NOLSE-NEXT: stlxp w9, x14, x15, [x10]
	; NOLSE-NEXT: cbnz w12, .LBB4_2			; NOLSE-NEXT: cbnz w9, .LBB4_2
	; NOLSE-NEXT: b .LBB4_5			; NOLSE-NEXT: b .LBB4_5
	; NOLSE-NEXT: .LBB4_4: // %atomicrmw.start			; NOLSE-NEXT: .LBB4_4: // %atomicrmw.start
	; NOLSE-NEXT: // in Loop: Header=BB4_2 Depth=2			; NOLSE-NEXT: // in Loop: Header=BB4_2 Depth=2
	; NOLSE-NEXT: stlxp w12, x10, x9, [x13]			; NOLSE-NEXT: stlxp w9, x12, x8, [x10]
	; NOLSE-NEXT: cbnz w12, .LBB4_2			; NOLSE-NEXT: cbnz w9, .LBB4_2
	; NOLSE-NEXT: .LBB4_5: // %atomicrmw.start			; NOLSE-NEXT: .LBB4_5: // %atomicrmw.start
	; NOLSE-NEXT: // in Loop: Header=BB4_1 Depth=1			; NOLSE-NEXT: // in Loop: Header=BB4_1 Depth=1
	; NOLSE-NEXT: eor x11, x9, x11			; NOLSE-NEXT: mov x9, x8
	; NOLSE-NEXT: eor x8, x10, x8
	; NOLSE-NEXT: orr x8, x8, x11
	; NOLSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill			; NOLSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill
				; NOLSE-NEXT: mov x10, x12
	; NOLSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill			; NOLSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill
				; NOLSE-NEXT: subs x12, x12, x13
				; NOLSE-NEXT: ccmp x8, x11, #0, eq
				; NOLSE-NEXT: cset w8, ne
	; NOLSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill			; NOLSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill
	; NOLSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill			; NOLSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill
	; NOLSE-NEXT: cbnz x8, .LBB4_1			; NOLSE-NEXT: tbnz w8, #0, .LBB4_1
	; NOLSE-NEXT: b .LBB4_6			; NOLSE-NEXT: b .LBB4_6
	; NOLSE-NEXT: .LBB4_6: // %atomicrmw.end			; NOLSE-NEXT: .LBB4_6: // %atomicrmw.end
	; NOLSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload
	; NOLSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload
	; NOLSE-NEXT: add sp, sp, #48			; NOLSE-NEXT: add sp, sp, #48
	; NOLSE-NEXT: ret			; NOLSE-NEXT: ret
	;			;
	; LSE-LABEL: test_rmw_add_128:			; LSE-LABEL: test_rmw_add_128:
	; LSE: // %bb.0: // %entry			; LSE: // %bb.0: // %entry
	; LSE-NEXT: sub sp, sp, #48			; LSE-NEXT: sub sp, sp, #48
	; LSE-NEXT: .cfi_def_cfa_offset 48			; LSE-NEXT: .cfi_def_cfa_offset 48
	; LSE-NEXT: str x0, [sp, #24] // 8-byte Folded Spill			; LSE-NEXT: str x0, [sp, #24] // 8-byte Folded Spill
	; LSE-NEXT: ldr x8, [x0, #8]			; LSE-NEXT: ldr x8, [x0, #8]
	; LSE-NEXT: ldr x9, [x0]			; LSE-NEXT: ldr x9, [x0]
	; LSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill			; LSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill
	; LSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill			; LSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill
	; LSE-NEXT: b .LBB4_1			; LSE-NEXT: b .LBB4_1
	; LSE-NEXT: .LBB4_1: // %atomicrmw.start			; LSE-NEXT: .LBB4_1: // %atomicrmw.start
	; LSE-NEXT: // =>This Inner Loop Header: Depth=1			; LSE-NEXT: // =>This Inner Loop Header: Depth=1
	; LSE-NEXT: ldr x10, [sp, #40] // 8-byte Folded Reload			; LSE-NEXT: ldr x8, [sp, #40] // 8-byte Folded Reload
	; LSE-NEXT: ldr x8, [sp, #32] // 8-byte Folded Reload			; LSE-NEXT: ldr x11, [sp, #32] // 8-byte Folded Reload
	; LSE-NEXT: ldr x9, [sp, #24] // 8-byte Folded Reload			; LSE-NEXT: ldr x9, [sp, #24] // 8-byte Folded Reload
	; LSE-NEXT: mov x0, x8			; LSE-NEXT: mov x0, x11
	; LSE-NEXT: mov x1, x10			; LSE-NEXT: mov x1, x8
	; LSE-NEXT: adds x2, x8, #1			; LSE-NEXT: adds x2, x11, #1
	; LSE-NEXT: cinc x11, x10, hs			; LSE-NEXT: cinc x10, x8, hs
	; LSE-NEXT: // kill: def $x2 killed $x2 def $x2_x3			; LSE-NEXT: // kill: def $x2 killed $x2 def $x2_x3
	; LSE-NEXT: mov x3, x11			; LSE-NEXT: mov x3, x10
	; LSE-NEXT: caspal x0, x1, x2, x3, [x9]			; LSE-NEXT: caspal x0, x1, x2, x3, [x9]
	; LSE-NEXT: mov x9, x1			; LSE-NEXT: mov x9, x1
	; LSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill			; LSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill
	; LSE-NEXT: eor x11, x9, x10
	; LSE-NEXT: mov x10, x0			; LSE-NEXT: mov x10, x0
	; LSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill			; LSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill
	; LSE-NEXT: eor x8, x10, x8			; LSE-NEXT: subs x11, x10, x11
	; LSE-NEXT: orr x8, x8, x11			; LSE-NEXT: ccmp x9, x8, #0, eq
				; LSE-NEXT: cset w8, ne
	; LSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill			; LSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill
	; LSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill			; LSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill
	; LSE-NEXT: cbnz x8, .LBB4_1			; LSE-NEXT: tbnz w8, #0, .LBB4_1
	; LSE-NEXT: b .LBB4_2			; LSE-NEXT: b .LBB4_2
	; LSE-NEXT: .LBB4_2: // %atomicrmw.end			; LSE-NEXT: .LBB4_2: // %atomicrmw.end
	; LSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload			; LSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload
	; LSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload			; LSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload
	; LSE-NEXT: add sp, sp, #48			; LSE-NEXT: add sp, sp, #48
	; LSE-NEXT: ret			; LSE-NEXT: ret
	entry:			entry:
	%res = atomicrmw add i128* %dst, i128 1 seq_cst			%res = atomicrmw add i128* %dst, i128 1 seq_cst
	▲ Show 20 Lines • Show All 303 Lines • ▼ Show 20 Lines
	; NOLSE-NEXT: ldr x9, [x0]			; NOLSE-NEXT: ldr x9, [x0]
	; NOLSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill			; NOLSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill
	; NOLSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill			; NOLSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill
	; NOLSE-NEXT: b .LBB9_1			; NOLSE-NEXT: b .LBB9_1
	; NOLSE-NEXT: .LBB9_1: // %atomicrmw.start			; NOLSE-NEXT: .LBB9_1: // %atomicrmw.start
	; NOLSE-NEXT: // =>This Loop Header: Depth=1			; NOLSE-NEXT: // =>This Loop Header: Depth=1
	; NOLSE-NEXT: // Child Loop BB9_2 Depth 2			; NOLSE-NEXT: // Child Loop BB9_2 Depth 2
	; NOLSE-NEXT: ldr x11, [sp, #40] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x11, [sp, #40] // 8-byte Folded Reload
	; NOLSE-NEXT: ldr x8, [sp, #32] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x13, [sp, #32] // 8-byte Folded Reload
	; NOLSE-NEXT: ldr x13, [sp, #24] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x10, [sp, #24] // 8-byte Folded Reload
	; NOLSE-NEXT: mov w9, w8			; NOLSE-NEXT: mov w8, w13
	; NOLSE-NEXT: mvn w10, w9			; NOLSE-NEXT: mvn w9, w8
	; NOLSE-NEXT: // implicit-def: $x9			; NOLSE-NEXT: // implicit-def: $x8
	; NOLSE-NEXT: mov w9, w10			; NOLSE-NEXT: mov w8, w9
	; NOLSE-NEXT: orr x14, x9, #0xfffffffffffffffe			; NOLSE-NEXT: orr x14, x8, #0xfffffffffffffffe
	; NOLSE-NEXT: mov x15, #-1			; NOLSE-NEXT: mov x15, #-1
	; NOLSE-NEXT: .LBB9_2: // %atomicrmw.start			; NOLSE-NEXT: .LBB9_2: // %atomicrmw.start
	; NOLSE-NEXT: // Parent Loop BB9_1 Depth=1			; NOLSE-NEXT: // Parent Loop BB9_1 Depth=1
	; NOLSE-NEXT: // => This Inner Loop Header: Depth=2			; NOLSE-NEXT: // => This Inner Loop Header: Depth=2
	; NOLSE-NEXT: ldaxp x10, x9, [x13]			; NOLSE-NEXT: ldaxp x12, x8, [x10]
	; NOLSE-NEXT: cmp x10, x8			; NOLSE-NEXT: cmp x12, x13
	; NOLSE-NEXT: cset w12, ne			; NOLSE-NEXT: cset w9, ne
	; NOLSE-NEXT: cmp x9, x11			; NOLSE-NEXT: cmp x8, x11
	; NOLSE-NEXT: cinc w12, w12, ne			; NOLSE-NEXT: cinc w9, w9, ne
	; NOLSE-NEXT: cbnz w12, .LBB9_4			; NOLSE-NEXT: cbnz w9, .LBB9_4
	; NOLSE-NEXT: // %bb.3: // %atomicrmw.start			; NOLSE-NEXT: // %bb.3: // %atomicrmw.start
	; NOLSE-NEXT: // in Loop: Header=BB9_2 Depth=2			; NOLSE-NEXT: // in Loop: Header=BB9_2 Depth=2
	; NOLSE-NEXT: stlxp w12, x14, x15, [x13]			; NOLSE-NEXT: stlxp w9, x14, x15, [x10]
	; NOLSE-NEXT: cbnz w12, .LBB9_2			; NOLSE-NEXT: cbnz w9, .LBB9_2
	; NOLSE-NEXT: b .LBB9_5			; NOLSE-NEXT: b .LBB9_5
	; NOLSE-NEXT: .LBB9_4: // %atomicrmw.start			; NOLSE-NEXT: .LBB9_4: // %atomicrmw.start
	; NOLSE-NEXT: // in Loop: Header=BB9_2 Depth=2			; NOLSE-NEXT: // in Loop: Header=BB9_2 Depth=2
	; NOLSE-NEXT: stlxp w12, x10, x9, [x13]			; NOLSE-NEXT: stlxp w9, x12, x8, [x10]
	; NOLSE-NEXT: cbnz w12, .LBB9_2			; NOLSE-NEXT: cbnz w9, .LBB9_2
	; NOLSE-NEXT: .LBB9_5: // %atomicrmw.start			; NOLSE-NEXT: .LBB9_5: // %atomicrmw.start
	; NOLSE-NEXT: // in Loop: Header=BB9_1 Depth=1			; NOLSE-NEXT: // in Loop: Header=BB9_1 Depth=1
	; NOLSE-NEXT: eor x11, x9, x11			; NOLSE-NEXT: mov x9, x8
	; NOLSE-NEXT: eor x8, x10, x8
	; NOLSE-NEXT: orr x8, x8, x11
	; NOLSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill			; NOLSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill
				; NOLSE-NEXT: mov x10, x12
	; NOLSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill			; NOLSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill
				; NOLSE-NEXT: subs x12, x12, x13
				; NOLSE-NEXT: ccmp x8, x11, #0, eq
				; NOLSE-NEXT: cset w8, ne
	; NOLSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill			; NOLSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill
	; NOLSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill			; NOLSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill
	; NOLSE-NEXT: cbnz x8, .LBB9_1			; NOLSE-NEXT: tbnz w8, #0, .LBB9_1
	; NOLSE-NEXT: b .LBB9_6			; NOLSE-NEXT: b .LBB9_6
	; NOLSE-NEXT: .LBB9_6: // %atomicrmw.end			; NOLSE-NEXT: .LBB9_6: // %atomicrmw.end
	; NOLSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload
	; NOLSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload			; NOLSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload
	; NOLSE-NEXT: add sp, sp, #48			; NOLSE-NEXT: add sp, sp, #48
	; NOLSE-NEXT: ret			; NOLSE-NEXT: ret
	;			;
	; LSE-LABEL: test_rmw_nand_128:			; LSE-LABEL: test_rmw_nand_128:
	; LSE: // %bb.0: // %entry			; LSE: // %bb.0: // %entry
	; LSE-NEXT: sub sp, sp, #48			; LSE-NEXT: sub sp, sp, #48
	; LSE-NEXT: .cfi_def_cfa_offset 48			; LSE-NEXT: .cfi_def_cfa_offset 48
	; LSE-NEXT: str x0, [sp, #24] // 8-byte Folded Spill			; LSE-NEXT: str x0, [sp, #24] // 8-byte Folded Spill
	; LSE-NEXT: ldr x8, [x0, #8]			; LSE-NEXT: ldr x8, [x0, #8]
	; LSE-NEXT: ldr x9, [x0]			; LSE-NEXT: ldr x9, [x0]
	; LSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill			; LSE-NEXT: str x9, [sp, #32] // 8-byte Folded Spill
	; LSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill			; LSE-NEXT: str x8, [sp, #40] // 8-byte Folded Spill
	; LSE-NEXT: b .LBB9_1			; LSE-NEXT: b .LBB9_1
	; LSE-NEXT: .LBB9_1: // %atomicrmw.start			; LSE-NEXT: .LBB9_1: // %atomicrmw.start
	; LSE-NEXT: // =>This Inner Loop Header: Depth=1			; LSE-NEXT: // =>This Inner Loop Header: Depth=1
	; LSE-NEXT: ldr x10, [sp, #40] // 8-byte Folded Reload			; LSE-NEXT: ldr x8, [sp, #40] // 8-byte Folded Reload
	; LSE-NEXT: ldr x8, [sp, #32] // 8-byte Folded Reload			; LSE-NEXT: ldr x11, [sp, #32] // 8-byte Folded Reload
	; LSE-NEXT: ldr x9, [sp, #24] // 8-byte Folded Reload			; LSE-NEXT: ldr x9, [sp, #24] // 8-byte Folded Reload
	; LSE-NEXT: mov x0, x8			; LSE-NEXT: mov x0, x11
	; LSE-NEXT: mov x1, x10			; LSE-NEXT: mov x1, x8
	; LSE-NEXT: mov w11, w8			; LSE-NEXT: mov w10, w11
	; LSE-NEXT: mvn w12, w11			; LSE-NEXT: mvn w12, w10
	; LSE-NEXT: // implicit-def: $x11			; LSE-NEXT: // implicit-def: $x10
	; LSE-NEXT: mov w11, w12			; LSE-NEXT: mov w10, w12
	; LSE-NEXT: orr x2, x11, #0xfffffffffffffffe			; LSE-NEXT: orr x2, x10, #0xfffffffffffffffe
	; LSE-NEXT: mov x11, #-1			; LSE-NEXT: mov x10, #-1
	; LSE-NEXT: // kill: def $x2 killed $x2 def $x2_x3			; LSE-NEXT: // kill: def $x2 killed $x2 def $x2_x3
	; LSE-NEXT: mov x3, x11			; LSE-NEXT: mov x3, x10
	; LSE-NEXT: caspal x0, x1, x2, x3, [x9]			; LSE-NEXT: caspal x0, x1, x2, x3, [x9]
	; LSE-NEXT: mov x9, x1			; LSE-NEXT: mov x9, x1
	; LSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill			; LSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill
	; LSE-NEXT: eor x11, x9, x10
	; LSE-NEXT: mov x10, x0			; LSE-NEXT: mov x10, x0
	; LSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill			; LSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill
	; LSE-NEXT: eor x8, x10, x8			; LSE-NEXT: subs x11, x10, x11
	; LSE-NEXT: orr x8, x8, x11			; LSE-NEXT: ccmp x9, x8, #0, eq
				; LSE-NEXT: cset w8, ne
	; LSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill			; LSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill
	; LSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill			; LSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill
	; LSE-NEXT: cbnz x8, .LBB9_1			; LSE-NEXT: tbnz w8, #0, .LBB9_1
	; LSE-NEXT: b .LBB9_2			; LSE-NEXT: b .LBB9_2
	; LSE-NEXT: .LBB9_2: // %atomicrmw.end			; LSE-NEXT: .LBB9_2: // %atomicrmw.end
	; LSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload			; LSE-NEXT: ldr x1, [sp, #8] // 8-byte Folded Reload
	; LSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload			; LSE-NEXT: ldr x0, [sp, #16] // 8-byte Folded Reload
	; LSE-NEXT: add sp, sp, #48			; LSE-NEXT: add sp, sp, #48
	; LSE-NEXT: ret			; LSE-NEXT: ret
	entry:			entry:
	%res = atomicrmw nand i128* %dst, i128 1 seq_cst			%res = atomicrmw nand i128* %dst, i128 1 seq_cst
	ret i128 %res			ret i128 %res
	}			}

llvm/test/CodeGen/AArch64/bcmp-inline-small.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -O2 < %s -mtriple=aarch64-linux-gnu \| FileCheck %s --check-prefix=CHECKN			; RUN: llc -O2 < %s -mtriple=aarch64-linux-gnu \| FileCheck %s --check-prefix=CHECKN
	; RUN: llc -O2 < %s -mtriple=aarch64-linux-gnu -mattr=strict-align \| FileCheck %s --check-prefix=CHECKS			; RUN: llc -O2 < %s -mtriple=aarch64-linux-gnu -mattr=strict-align \| FileCheck %s --check-prefix=CHECKS

	declare i32 @bcmp(i8, i8, i64) nounwind readonly			declare i32 @bcmp(i8, i8, i64) nounwind readonly
	declare i32 @memcmp(i8, i8, i64) nounwind readonly			declare i32 @memcmp(i8, i8, i64) nounwind readonly

	define i1 @test_b2(i8* %s1, i8* %s2) {			define i1 @test_b2(i8* %s1, i8* %s2) {
	; CHECKN-LABEL: test_b2:			; CHECKN-LABEL: test_b2:
	; CHECKN: // %bb.0: // %entry			; CHECKN: // %bb.0: // %entry
	; CHECKN-NEXT: ldr x8, [x0]			; CHECKN-NEXT: ldr x8, [x0]
	; CHECKN-NEXT: ldr x9, [x1]			; CHECKN-NEXT: ldr x9, [x1]
	; CHECKN-NEXT: ldur x10, [x0, #7]			; CHECKN-NEXT: ldur x10, [x0, #7]
	; CHECKN-NEXT: ldur x11, [x1, #7]			; CHECKN-NEXT: ldur x11, [x1, #7]
	; CHECKN-NEXT: eor x8, x8, x9			; CHECKN-NEXT: cmp x8, x9
	; CHECKN-NEXT: eor x9, x10, x11			; CHECKN-NEXT: ccmp x10, x11, #0, eq
	; CHECKN-NEXT: orr x8, x8, x9
	; CHECKN-NEXT: cmp x8, #0
	; CHECKN-NEXT: cset w0, eq			; CHECKN-NEXT: cset w0, eq
	; CHECKN-NEXT: ret			; CHECKN-NEXT: ret
	;			;
	; CHECKS-LABEL: test_b2:			; CHECKS-LABEL: test_b2:
	; CHECKS: // %bb.0: // %entry			; CHECKS: // %bb.0: // %entry
	; CHECKS-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill			; CHECKS-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
	; CHECKS-NEXT: .cfi_def_cfa_offset 16			; CHECKS-NEXT: .cfi_def_cfa_offset 16
	; CHECKS-NEXT: .cfi_offset w30, -16			; CHECKS-NEXT: .cfi_offset w30, -16
	Show All 12 Lines
	; TODO: Four loads should be within the limit, but the heuristic isn't implemented.			; TODO: Four loads should be within the limit, but the heuristic isn't implemented.
	define i1 @test_b2_align8(i8* align 8 %s1, i8* align 8 %s2) {			define i1 @test_b2_align8(i8* align 8 %s1, i8* align 8 %s2) {
	; CHECKN-LABEL: test_b2_align8:			; CHECKN-LABEL: test_b2_align8:
	; CHECKN: // %bb.0: // %entry			; CHECKN: // %bb.0: // %entry
	; CHECKN-NEXT: ldr x8, [x0]			; CHECKN-NEXT: ldr x8, [x0]
	; CHECKN-NEXT: ldr x9, [x1]			; CHECKN-NEXT: ldr x9, [x1]
	; CHECKN-NEXT: ldur x10, [x0, #7]			; CHECKN-NEXT: ldur x10, [x0, #7]
	; CHECKN-NEXT: ldur x11, [x1, #7]			; CHECKN-NEXT: ldur x11, [x1, #7]
	; CHECKN-NEXT: eor x8, x8, x9			; CHECKN-NEXT: cmp x8, x9
	; CHECKN-NEXT: eor x9, x10, x11			; CHECKN-NEXT: ccmp x10, x11, #0, eq
	; CHECKN-NEXT: orr x8, x8, x9
	; CHECKN-NEXT: cmp x8, #0
	; CHECKN-NEXT: cset w0, eq			; CHECKN-NEXT: cset w0, eq
	; CHECKN-NEXT: ret			; CHECKN-NEXT: ret
	;			;
	; CHECKS-LABEL: test_b2_align8:			; CHECKS-LABEL: test_b2_align8:
	; CHECKS: // %bb.0: // %entry			; CHECKS: // %bb.0: // %entry
	; CHECKS-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill			; CHECKS-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
	; CHECKS-NEXT: .cfi_def_cfa_offset 16			; CHECKS-NEXT: .cfi_def_cfa_offset 16
	; CHECKS-NEXT: .cfi_offset w30, -16			; CHECKS-NEXT: .cfi_offset w30, -16
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/bcmp.ll

	Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines

	define i1 @bcmp7(ptr %a, ptr %b) {			define i1 @bcmp7(ptr %a, ptr %b) {
	; CHECK-LABEL: bcmp7:			; CHECK-LABEL: bcmp7:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldr w8, [x0]			; CHECK-NEXT: ldr w8, [x0]
	; CHECK-NEXT: ldr w9, [x1]			; CHECK-NEXT: ldr w9, [x1]
	; CHECK-NEXT: ldur w10, [x0, #3]			; CHECK-NEXT: ldur w10, [x0, #3]
	; CHECK-NEXT: ldur w11, [x1, #3]			; CHECK-NEXT: ldur w11, [x1, #3]
	; CHECK-NEXT: eor w8, w8, w9			; CHECK-NEXT: cmp w8, w9
	; CHECK-NEXT: eor w9, w10, w11			; CHECK-NEXT: ccmp w10, w11, #0, eq
	; CHECK-NEXT: orr w8, w8, w9
	; CHECK-NEXT: cmp w8, #0
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cr = call i32 @bcmp(ptr %a, ptr %b, i64 7)			%cr = call i32 @bcmp(ptr %a, ptr %b, i64 7)
	%r = icmp eq i32 %cr, 0			%r = icmp eq i32 %cr, 0
	ret i1 %r			ret i1 %r
	}			}

	define i1 @bcmp8(ptr %a, ptr %b) {			define i1 @bcmp8(ptr %a, ptr %b) {
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines

	define i1 @bcmp11(ptr %a, ptr %b) {			define i1 @bcmp11(ptr %a, ptr %b) {
	; CHECK-LABEL: bcmp11:			; CHECK-LABEL: bcmp11:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldr x8, [x0]			; CHECK-NEXT: ldr x8, [x0]
	; CHECK-NEXT: ldr x9, [x1]			; CHECK-NEXT: ldr x9, [x1]
	; CHECK-NEXT: ldur x10, [x0, #3]			; CHECK-NEXT: ldur x10, [x0, #3]
	; CHECK-NEXT: ldur x11, [x1, #3]			; CHECK-NEXT: ldur x11, [x1, #3]
	; CHECK-NEXT: eor x8, x8, x9			; CHECK-NEXT: cmp x8, x9
	; CHECK-NEXT: eor x9, x10, x11			; CHECK-NEXT: ccmp x10, x11, #0, eq
	; CHECK-NEXT: orr x8, x8, x9
	; CHECK-NEXT: cmp x8, #0
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cr = call i32 @bcmp(ptr %a, ptr %b, i64 11)			%cr = call i32 @bcmp(ptr %a, ptr %b, i64 11)
	%r = icmp eq i32 %cr, 0			%r = icmp eq i32 %cr, 0
	ret i1 %r			ret i1 %r
	}			}

	define i1 @bcmp12(ptr %a, ptr %b) {			define i1 @bcmp12(ptr %a, ptr %b) {
	Show All 16 Lines

	define i1 @bcmp13(ptr %a, ptr %b) {			define i1 @bcmp13(ptr %a, ptr %b) {
	; CHECK-LABEL: bcmp13:			; CHECK-LABEL: bcmp13:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldr x8, [x0]			; CHECK-NEXT: ldr x8, [x0]
	; CHECK-NEXT: ldr x9, [x1]			; CHECK-NEXT: ldr x9, [x1]
	; CHECK-NEXT: ldur x10, [x0, #5]			; CHECK-NEXT: ldur x10, [x0, #5]
	; CHECK-NEXT: ldur x11, [x1, #5]			; CHECK-NEXT: ldur x11, [x1, #5]
	; CHECK-NEXT: eor x8, x8, x9			; CHECK-NEXT: cmp x8, x9
	; CHECK-NEXT: eor x9, x10, x11			; CHECK-NEXT: ccmp x10, x11, #0, eq
	; CHECK-NEXT: orr x8, x8, x9
	; CHECK-NEXT: cmp x8, #0
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cr = call i32 @bcmp(ptr %a, ptr %b, i64 13)			%cr = call i32 @bcmp(ptr %a, ptr %b, i64 13)
	%r = icmp eq i32 %cr, 0			%r = icmp eq i32 %cr, 0
	ret i1 %r			ret i1 %r
	}			}

	define i1 @bcmp14(ptr %a, ptr %b) {			define i1 @bcmp14(ptr %a, ptr %b) {
	; CHECK-LABEL: bcmp14:			; CHECK-LABEL: bcmp14:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldr x8, [x0]			; CHECK-NEXT: ldr x8, [x0]
	; CHECK-NEXT: ldr x9, [x1]			; CHECK-NEXT: ldr x9, [x1]
	; CHECK-NEXT: ldur x10, [x0, #6]			; CHECK-NEXT: ldur x10, [x0, #6]
	; CHECK-NEXT: ldur x11, [x1, #6]			; CHECK-NEXT: ldur x11, [x1, #6]
	; CHECK-NEXT: eor x8, x8, x9			; CHECK-NEXT: cmp x8, x9
	; CHECK-NEXT: eor x9, x10, x11			; CHECK-NEXT: ccmp x10, x11, #0, eq
	; CHECK-NEXT: orr x8, x8, x9
	; CHECK-NEXT: cmp x8, #0
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cr = call i32 @bcmp(ptr %a, ptr %b, i64 14)			%cr = call i32 @bcmp(ptr %a, ptr %b, i64 14)
	%r = icmp eq i32 %cr, 0			%r = icmp eq i32 %cr, 0
	ret i1 %r			ret i1 %r
	}			}

	define i1 @bcmp15(ptr %a, ptr %b) {			define i1 @bcmp15(ptr %a, ptr %b) {
	; CHECK-LABEL: bcmp15:			; CHECK-LABEL: bcmp15:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldr x8, [x0]			; CHECK-NEXT: ldr x8, [x0]
	; CHECK-NEXT: ldr x9, [x1]			; CHECK-NEXT: ldr x9, [x1]
	; CHECK-NEXT: ldur x10, [x0, #7]			; CHECK-NEXT: ldur x10, [x0, #7]
	; CHECK-NEXT: ldur x11, [x1, #7]			; CHECK-NEXT: ldur x11, [x1, #7]
	; CHECK-NEXT: eor x8, x8, x9			; CHECK-NEXT: cmp x8, x9
	; CHECK-NEXT: eor x9, x10, x11			; CHECK-NEXT: ccmp x10, x11, #0, eq
	; CHECK-NEXT: orr x8, x8, x9
	; CHECK-NEXT: cmp x8, #0
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cr = call i32 @bcmp(ptr %a, ptr %b, i64 15)			%cr = call i32 @bcmp(ptr %a, ptr %b, i64 15)
	%r = icmp eq i32 %cr, 0			%r = icmp eq i32 %cr, 0
	ret i1 %r			ret i1 %r
	}			}

	define i1 @bcmp16(ptr %a, ptr %b) {			define i1 @bcmp16(ptr %a, ptr %b) {
	; CHECK-LABEL: bcmp16:			; CHECK-LABEL: bcmp16:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldp x8, x9, [x0]			; CHECK-NEXT: ldp x8, x9, [x0]
	; CHECK-NEXT: ldp x10, x11, [x1]			; CHECK-NEXT: ldp x10, x11, [x1]
	; CHECK-NEXT: eor x8, x8, x10			; CHECK-NEXT: cmp x8, x10
	; CHECK-NEXT: eor x9, x9, x11			; CHECK-NEXT: ccmp x9, x11, #0, eq
	; CHECK-NEXT: orr x8, x8, x9
	; CHECK-NEXT: cmp x8, #0
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cr = call i32 @bcmp(ptr %a, ptr %b, i64 16)			%cr = call i32 @bcmp(ptr %a, ptr %b, i64 16)
	%r = icmp eq i32 %cr, 0			%r = icmp eq i32 %cr, 0
	ret i1 %r			ret i1 %r
	}			}

	define i1 @bcmp20(ptr %a, ptr %b) {			define i1 @bcmp20(ptr %a, ptr %b) {
	▲ Show 20 Lines • Show All 197 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/dag-combine-setcc.ll

	Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: fmov w8, s0			; CHECK-NEXT: fmov w8, s0
	; CHECK-NEXT: and w0, w8, #0x1			; CHECK-NEXT: and w0, w8, #0x1
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cmp1 = icmp ne <64 x i8> %a, zeroinitializer			%cmp1 = icmp ne <64 x i8> %a, zeroinitializer
	%cast = bitcast <64 x i1> %cmp1 to i64			%cast = bitcast <64 x i1> %cmp1 to i64
	%cmp2 = icmp ne i64 %cast, zeroinitializer			%cmp2 = icmp ne i64 %cast, zeroinitializer
	ret i1 %cmp2			ret i1 %cmp2
	}			}

				define i1 @combine_setcc_eq0_conjunction_xor_or(ptr %a, ptr %b) {
				; CHECK-LABEL: combine_setcc_eq0_conjunction_xor_or:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp x8, x9, [x0]
				; CHECK-NEXT: ldp x10, x11, [x1]
				; CHECK-NEXT: cmp x8, x10
				; CHECK-NEXT: ccmp x9, x11, #0, eq
				; CHECK-NEXT: cset w0, eq
				; CHECK-NEXT: ret
				%bcmp = tail call i32 @bcmp(ptr dereferenceable(16) %a, ptr dereferenceable(16) %b, i64 16)
				%cmp = icmp eq i32 %bcmp, 0
				ret i1 %cmp
				}

				define i1 @combine_setcc_ne0_conjunction_xor_or(ptr %a, ptr %b) {
				; CHECK-LABEL: combine_setcc_ne0_conjunction_xor_or:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp x8, x9, [x0]
				; CHECK-NEXT: ldp x10, x11, [x1]
				; CHECK-NEXT: cmp x8, x10
				; CHECK-NEXT: ccmp x9, x11, #0, eq
				; CHECK-NEXT: cset w0, ne
				; CHECK-NEXT: ret
				%bcmp = tail call i32 @bcmp(ptr dereferenceable(16) %a, ptr dereferenceable(16) %b, i64 16)
				%cmp = icmp ne i32 %bcmp, 0
				ret i1 %cmp
				}

				; Doesn't increase the number of instructions, where the LHS has multiple uses
				define i32 @combine_setcc_multiuse(i32 %0, i32 %1, i32 %2, i32 %3) {
				; CHECK-LABEL: combine_setcc_multiuse:
				; CHECK: // %bb.0:
				; CHECK-NEXT: eor w8, w1, w0
				; CHECK-NEXT: eor w9, w3, w2
				; CHECK-NEXT: orr w8, w9, w8
				; CHECK-NEXT: cbz w8, .LBB10_2
				; CHECK-NEXT: // %bb.1:
				; CHECK-NEXT: mov w0, w8
				; CHECK-NEXT: b use
				; CHECK-NEXT: .LBB10_2:
				; CHECK-NEXT: ret
				%5 = xor i32 %1, %0
				%6 = xor i32 %3, %2
				%7 = or i32 %6, %5
				%8 = icmp eq i32 %7, 0
				br i1 %8, label %11, label %9

				9: ; preds = %4
				%10 = tail call i32 @use(i32 %7) #2
				br label %11

				11: ; preds = %4, %9
				%12 = phi i32 [ %10, %9 ], [ %0, %4 ]
				ret i32 %12
				}

				; There may be issues with the CMP/CCMP with the scheduling of instructions
				; that ISel will create out of the DAG
				define i32 @combine_setcc_glue(i128 noundef %x, i128 noundef %y) {
				; CHECK-LABEL: combine_setcc_glue:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: cmp x0, x2
				; CHECK-NEXT: cset w8, eq
				; CHECK-NEXT: ccmp x1, x3, #0, eq
				; CHECK-NEXT: cset w9, eq
				; CHECK-NEXT: orr w0, w9, w8
				; CHECK-NEXT: ret
				entry:
				%cmp3 = icmp eq i128 %x, %y
				%conv = trunc i128 %x to i64
				%conv1 = trunc i128 %y to i64
				%cmp = icmp eq i64 %conv, %conv1
				%or7 = or i1 %cmp3, %cmp
				%or = zext i1 %or7 to i32
				ret i32 %or
				}

				; Reduced test from https://github.com/llvm/llvm-project/issues/58675
				define [2 x i64] @PR58675(i128 %a.addr, i128 %b.addr) {
				; CHECK-LABEL: PR58675:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: mov x8, xzr
				; CHECK-NEXT: mov x9, xzr
				; CHECK-NEXT: .LBB12_1: // %do.body
				; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: cmp x0, x8
				; CHECK-NEXT: csel x10, x0, x8, lo
				; CHECK-NEXT: cmp x1, x9
				; CHECK-NEXT: csel x8, x0, x8, lo
				; CHECK-NEXT: csel x8, x10, x8, eq
				; CHECK-NEXT: csel x10, x1, x9, lo
				; CHECK-NEXT: subs x8, x2, x8
				; CHECK-NEXT: sbc x9, x3, x10
				; CHECK-NEXT: ccmp x3, x10, #0, eq
				; CHECK-NEXT: b.ne .LBB12_1
				; CHECK-NEXT: // %bb.2: // %do.end
				; CHECK-NEXT: mov x0, xzr
				; CHECK-NEXT: mov x1, xzr
				; CHECK-NEXT: ret
				entry:
				br label %do.body

				do.body: ; preds = %do.body, %entry
				%a.addr.i1 = phi i128 [ 1, %do.body ], [ 0, %entry ]
				%b.addr.i2 = phi i128 [ %sub, %do.body ], [ 0, %entry ]
				%0 = tail call i128 @llvm.umin.i128(i128 %a.addr, i128 %b.addr.i2)
				%1 = tail call i128 @llvm.umax.i128(i128 0, i128 %a.addr)
				%sub = sub i128 %b.addr, %0
				%cmp18.not = icmp eq i128 %b.addr, %0
				br i1 %cmp18.not, label %do.end, label %do.body

				do.end: ; preds = %do.body
				ret [2 x i64] zeroinitializer
				}

				declare i128 @llvm.umin.i128(i128, i128)
				declare i128 @llvm.umax.i128(i128, i128)
				declare i32 @bcmp(ptr nocapture, ptr nocapture, i64)
				declare i32 @use(i32 noundef)

llvm/test/CodeGen/AArch64/i128-cmp.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-uknown-uknown -verify-machineinstrs -o - %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-uknown-uknown -verify-machineinstrs -o - %s \| FileCheck %s

	declare void @call()			declare void @call()

	define i1 @cmp_i128_eq(i128 %a, i128 %b) {			define i1 @cmp_i128_eq(i128 %a, i128 %b) {
	; CHECK-LABEL: cmp_i128_eq:			; CHECK-LABEL: cmp_i128_eq:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: eor x8, x1, x3			; CHECK-NEXT: cmp x0, x2
	; CHECK-NEXT: eor x9, x0, x2			; CHECK-NEXT: ccmp x1, x3, #0, eq
	; CHECK-NEXT: orr x8, x9, x8
	; CHECK-NEXT: cmp x8, #0
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cmp = icmp eq i128 %a, %b			%cmp = icmp eq i128 %a, %b
				dmgreenUnsubmitted Done Reply Inline Actions Are you sure this is correct? It doesn't look right. I think I would expect `ccmp #0, eq; cset eq`. dmgreen: Are you sure this is correct? It doesn't look right. I think I would expect `ccmp #0, eq…
				AllenAuthorUnsubmitted Done Reply Inline Actions Yes, I test the executive for the initial case in https://github.com/llvm/llvm-project/issues/58061 Allen: Yes, I test the executive for the initial case in https://github.com/llvm/llvm…
				dmgreenUnsubmitted Done Reply Inline Actions How did you test that? Is it the code from https://gcc.godbolt.org/z/Tv1YP6bPc? Could it have been constant-folded away? dmgreen: How did you test that? Is it the code from https://gcc.godbolt.org/z/Tv1YP6bPc? Could it have…
				AllenAuthorUnsubmitted Done Reply Inline Actions Yes, I test the executive very simple, just run the following cmd with and without the changes. ~/llvm-project-upstream/build/bin/clang -march=armv8.2-a -O3 run.c -ffast-math;./a.out Allen: Yes, I test the executive very simple, just run the following cmd with and without the changes.
				dmgreenUnsubmitted Done Reply Inline Actions Have you tried with -fno-inline? The example godbolt link has everything inlined into main, and constant folded into the arguments of the printf's. It doesn't look like it is really testing the codegen. I would expect the code to be the same as this for eq, due to the way i128 is split into i64 registers: https://godbolt.org/z/aed4bYn6n dmgreen: Have you tried with -fno-inline? The example godbolt link has everything inlined into main, and…
				AllenAuthorUnsubmitted Done Reply Inline Actions oh, thanks very much, you are right. But how can I get the #8 for case cmp_i128_ne ? it seem the input Code should be MI or LT when it return expect "#8" with function getNZCVToSatisfyCondCode ? Allen: oh, thanks very much, you are right. But how can I get the #8 for case cmp_i128_ne ? it…
				dmgreenUnsubmitted Done Reply Inline Actions Yeah, #8 is not the only choice. I think it can be any constant where the the Z bit is clear, so #0 is fine (as is #8). #4 is not because that is the Z bit. Because the eq is a && and the ne is a \|\|, it might be simpler to just use a constant of 0 in both cases, providing that works. dmgreen: Yeah, #8 is not the only choice. I think it can be any constant where the the Z bit is clear…
				AllenAuthorUnsubmitted Done Reply Inline Actions Apply your comment, thanks for guidance Allen: Apply your comment, thanks for guidance
	ret i1 %cmp			ret i1 %cmp
	}			}

	define i1 @cmp_i128_ne(i128 %a, i128 %b) {			define i1 @cmp_i128_ne(i128 %a, i128 %b) {
	; CHECK-LABEL: cmp_i128_ne:			; CHECK-LABEL: cmp_i128_ne:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: eor x8, x1, x3			; CHECK-NEXT: cmp x0, x2
	; CHECK-NEXT: eor x9, x0, x2			; CHECK-NEXT: ccmp x1, x3, #0, eq
	; CHECK-NEXT: orr x8, x9, x8
	; CHECK-NEXT: cmp x8, #0
	; CHECK-NEXT: cset w0, ne			; CHECK-NEXT: cset w0, ne
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cmp = icmp ne i128 %a, %b			%cmp = icmp ne i128 %a, %b
				dmgreenUnsubmitted Done Reply Inline Actions And here it needs to set based on ne, so maybe `ccmp #8, eq; cmp ne`. Those two verify as equivalent. dmgreen: And here it needs to set based on ne, so maybe `ccmp #8, eq; cmp ne`. Those two verify as…
				AllenAuthorUnsubmitted Done Reply Inline Actions can also find the cases in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611 Allen: can also find the cases in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611
	ret i1 %cmp			ret i1 %cmp
	}			}

	define i1 @cmp_i128_ugt(i128 %a, i128 %b) {			define i1 @cmp_i128_ugt(i128 %a, i128 %b) {
	; CHECK-LABEL: cmp_i128_ugt:			; CHECK-LABEL: cmp_i128_ugt:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: cmp x2, x0			; CHECK-NEXT: cmp x2, x0
	; CHECK-NEXT: sbcs xzr, x3, x1			; CHECK-NEXT: sbcs xzr, x3, x1
	▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cmp = icmp sle i128 %a, %b			%cmp = icmp sle i128 %a, %b
	ret i1 %cmp			ret i1 %cmp
	}			}

	define void @br_on_cmp_i128_eq(i128 %a, i128 %b) nounwind {			define void @br_on_cmp_i128_eq(i128 %a, i128 %b) nounwind {
	; CHECK-LABEL: br_on_cmp_i128_eq:			; CHECK-LABEL: br_on_cmp_i128_eq:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: eor x8, x1, x3			; CHECK-NEXT: cmp x0, x2
	; CHECK-NEXT: eor x9, x0, x2			; CHECK-NEXT: ccmp x1, x3, #0, eq
	; CHECK-NEXT: orr x8, x9, x8			; CHECK-NEXT: b.ne .LBB10_2
				efriedmaUnsubmitted Done Reply Inline Actions Is there some reason we don't want to combine this to cmp+ccmp+b.ne? efriedma: Is there some reason we don't want to combine this to cmp+ccmp+b.ne?
				AllenAuthorUnsubmitted Done Reply Inline Actions Thanks for your attention. This case is block by the constraintN->use_begin()->getOpcode() != ISD::BRCOND, as I can't confirm that there is necessarily a benefit in this scenario. such as case test_rmw_add_128 in file CodeGen/AArch64/atomicrmw-O0.ll. If we can ignore the regression of O0, then I can relex this constraint ? SelectionDAG has 19 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t6: i64,ch = CopyFromReg t0, Register:i64 %2 t26: i64 = xor t2, t6 t4: i64,ch = CopyFromReg t0, Register:i64 %1 t8: i64,ch = CopyFromReg t0, Register:i64 %3 t27: i64 = xor t4, t8 t28: i64 = or t26, t27 t22: i32 = setcc t28, Constant:i64<0>, setne:ch t21: ch = brcond t0, t22, BasicBlock:ch<exit 0xaaaab28f7268> t18: ch = br t21, BasicBlock:ch<call 0xaaaab28f7170> This is the key change of case test_rmw_add_128 , which is compiled with -O0. -; NOLSE-NEXT: eor x11, x9, x11 -; NOLSE-NEXT: eor x8, x10, x8 -; NOLSE-NEXT: orr x8, x8, x11 +; NOLSE-NEXT: mov x9, x8 ; NOLSE-NEXT: str x9, [sp, #8] // 8-byte Folded Spill +; NOLSE-NEXT: mov x10, x12 ; NOLSE-NEXT: str x10, [sp, #16] // 8-byte Folded Spill +; NOLSE-NEXT: subs x12, x12, x13 +; NOLSE-NEXT: ccmp x8, x11, #0, eq +; NOLSE-NEXT: cset w8, eq ; NOLSE-NEXT: str x10, [sp, #32] // 8-byte Folded Spill ; NOLSE-NEXT: str x9, [sp, #40] // 8-byte Folded Spill -; NOLSE-NEXT: cbnz x8, .LBB4_1 +; NOLSE-NEXT: tbnz w8, #0, .LBB4_1 Allen: * Thanks for your attention. This case is block by the constraint**N->use_begin()->getOpcode() !
				efriedmaUnsubmitted Done Reply Inline Actions We can mostly ignore codesize at -O0. (I mean, it matters to the extent that really bloated code can start to impact compile-time, but that isn't relevant here.) efriedma: We can mostly ignore codesize at -O0. (I mean, it matters to the extent that really bloated…
				AllenAuthorUnsubmitted Done Reply Inline Actions Done, Thank you for your guidance. Allen: Done, Thank you for your guidance.
	; CHECK-NEXT: cbnz x8, .LBB10_2
	; CHECK-NEXT: // %bb.1: // %call			; CHECK-NEXT: // %bb.1: // %call
	; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill			; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
	; CHECK-NEXT: bl call			; CHECK-NEXT: bl call
	; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload			; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
	; CHECK-NEXT: .LBB10_2: // %exit			; CHECK-NEXT: .LBB10_2: // %exit
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cmp = icmp eq i128 %a, %b			%cmp = icmp eq i128 %a, %b
	br i1 %cmp, label %call, label %exit			br i1 %cmp, label %call, label %exit
	call:			call:
	call void @call()			call void @call()
	br label %exit			br label %exit
	exit:			exit:
	ret void			ret void
	}			}

	define void @br_on_cmp_i128_ne(i128 %a, i128 %b) nounwind {			define void @br_on_cmp_i128_ne(i128 %a, i128 %b) nounwind {
	; CHECK-LABEL: br_on_cmp_i128_ne:			; CHECK-LABEL: br_on_cmp_i128_ne:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: eor x8, x1, x3			; CHECK-NEXT: cmp x0, x2
	; CHECK-NEXT: eor x9, x0, x2			; CHECK-NEXT: ccmp x1, x3, #0, eq
	; CHECK-NEXT: orr x8, x9, x8			; CHECK-NEXT: b.eq .LBB11_2
	; CHECK-NEXT: cbz x8, .LBB11_2
	; CHECK-NEXT: // %bb.1: // %call			; CHECK-NEXT: // %bb.1: // %call
	; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill			; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
	; CHECK-NEXT: bl call			; CHECK-NEXT: bl call
	; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload			; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
	; CHECK-NEXT: .LBB11_2: // %exit			; CHECK-NEXT: .LBB11_2: // %exit
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%cmp = icmp ne i128 %a, %b			%cmp = icmp ne i128 %a, %b
	br i1 %cmp, label %call, label %exit			br i1 %cmp, label %call, label %exit
	▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/umulo-128-legalisation-lowering.ll

	Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	; AARCH-NEXT: cinc x11, x14, hs			; AARCH-NEXT: cinc x11, x14, hs
	; AARCH-NEXT: mul x0, x0, x2			; AARCH-NEXT: mul x0, x0, x2
	; AARCH-NEXT: adds x11, x13, x11			; AARCH-NEXT: adds x11, x13, x11
	; AARCH-NEXT: umulh x13, x8, x3			; AARCH-NEXT: umulh x13, x8, x3
	; AARCH-NEXT: cset w14, hs			; AARCH-NEXT: cset w14, hs
	; AARCH-NEXT: adds x11, x12, x11			; AARCH-NEXT: adds x11, x12, x11
	; AARCH-NEXT: adc x12, x13, x14			; AARCH-NEXT: adc x12, x13, x14
	; AARCH-NEXT: adds x10, x11, x10			; AARCH-NEXT: adds x10, x11, x10
	; AARCH-NEXT: adc x9, x12, x9
	; AARCH-NEXT: asr x11, x1, #63			; AARCH-NEXT: asr x11, x1, #63
	; AARCH-NEXT: eor x9, x9, x11			; AARCH-NEXT: adc x9, x12, x9
	; AARCH-NEXT: eor x10, x10, x11			; AARCH-NEXT: cmp x10, x11
	; AARCH-NEXT: orr x9, x10, x9			; AARCH-NEXT: ccmp x9, x11, #0, eq
	; AARCH-NEXT: cmp x9, #0
	; AARCH-NEXT: cset w9, ne			; AARCH-NEXT: cset w9, ne
	; AARCH-NEXT: tbz x8, #63, .LBB1_2			; AARCH-NEXT: tbz x8, #63, .LBB1_2
	; AARCH-NEXT: // %bb.1: // %Entry			; AARCH-NEXT: // %bb.1: // %Entry
	; AARCH-NEXT: eor x8, x3, #0x8000000000000000			; AARCH-NEXT: eor x8, x3, #0x8000000000000000
	; AARCH-NEXT: orr x8, x2, x8			; AARCH-NEXT: orr x8, x2, x8
	; AARCH-NEXT: cbz x8, .LBB1_3			; AARCH-NEXT: cbz x8, .LBB1_3
	; AARCH-NEXT: .LBB1_2: // %Else2			; AARCH-NEXT: .LBB1_2: // %Else2
	; AARCH-NEXT: cbz w9, .LBB1_4			; AARCH-NEXT: cbz w9, .LBB1_4
	Show All 33 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality with 0ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 471760

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/atomicrmw-O0.ll

llvm/test/CodeGen/AArch64/bcmp-inline-small.ll

llvm/test/CodeGen/AArch64/bcmp.ll

llvm/test/CodeGen/AArch64/dag-combine-setcc.ll

llvm/test/CodeGen/AArch64/i128-cmp.ll

llvm/test/CodeGen/AArch64/umulo-128-legalisation-lowering.ll

Recommit [AArch64] Optimize memcmp when the result is tested for [in]equality with 0
ClosedPublic