This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Optimize more memcmp when the result is tested for [in]equality with 0
ClosedPublic

Authored by Allen on Nov 9 2022, 8:31 AM.

Details

Summary

We already surpport the or (xor a, b), (xor c, d) with D136244, while it should
capture more cases than just bcmp according the comment on
https://reviews.llvm.org/D136672, so this patch try to fold continuous
comparison series.

Also add a new callsite in LowerSETCC to address some cases folded And in the
stage of Optimized type-legalized selection.

Diff Detail

Event Timeline

Allen created this revision.Nov 9 2022, 8:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 9 2022, 8:31 AM
Allen requested review of this revision.Nov 9 2022, 8:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 9 2022, 8:31 AM

Looks like a good patch. I like the way it checks for chains of ORs.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19655

The check for OR could be moved below the check for XOR, saving XOR from being checked twice.

19670

Could this be called something like performOrXorChainCombine. I'm not sure I love that name, but the transform is more general than just memcmp. It would be good to mention memcmp/bmp in the comments of the function perhaps, to explain the motivation.

19682

LLVM likes to exit early by inverting the condition. It can help reduce indentations.

19684

Perhaps rename isMemcmpPattern to isOrXorChain too.

19694

i -> I

19711

Could this do:

if (!DCI.isBeforeLegalize())
  if (SDValue V = performMemcmpCombine(N, DAG))
    return V;

It should avoid needing to pass DCI through to performMemcmpCombine.

llvm/test/CodeGen/AArch64/bcmp.ll
43

I think this might be fixed if performMemcmpCombine was called from lowerSETCC as well as from the combine. It looks like the issue is that it we do not call the performMemcmpCombine after the operands further up the tree have been simplified.

145

This And shouldn't be here - it should be pushed higher into the ldrb's. There is some code that already tried to push ands up into loads, but I've not looked into whether it could be extended to handle this too.

Allen updated this revision to Diff 474415.Nov 9 2022, 7:15 PM
Allen marked an inline comment as done.

Address comments

Allen edited the summary of this revision. (Show Details)Nov 9 2022, 7:17 PM
Allen marked 5 inline comments as done.
Allen added inline comments.
llvm/test/CodeGen/AArch64/bcmp.ll
43

Thansk, apply with your commend

145

Ok, I'll try to figure out that with a following patch, add a TODO for this case

bcl5980 added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8571

Do we need add one-use check for or?
Or generate the depth of the node to determine the more than one use node can enable the opt or not?

8592

cmp A0, A1; ccmp B0, B1 ?

bcl5980 added inline comments.Nov 9 2022, 11:15 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8565

Can ISD::SUB be leaf node also?

Allen updated this revision to Diff 474482.Nov 10 2022, 2:05 AM
Allen marked 2 inline comments as done.
Allen added a reviewer: bcl5980.
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8565

Thanks, But I doesn't sure this need be extended as the bcmp doesn't generate such IR after outlining?

8571

Yes, we can check the one-use to start, thanks.
If more complex scenarios are required in the actual code, then the depth may also need later.

8592

Done, thanks!

bcl5980 added inline comments.Nov 10 2022, 5:11 AM
llvm/test/CodeGen/AArch64/bcmp.ll
415–428

I agree that cmp+ccmp chain is generally better but a little worry about this test case.
cmp chain need 8 cycles to do on every machine.
But 8 xor + 7 or + 1 cmp can run faster on high end cpu. For example a 4 width int alu port machine.
2 cycle for xor
3 cycle for or
1 cycle for cmp
total 6 cycle.

Allen added inline comments.Nov 10 2022, 6:35 AM
llvm/test/CodeGen/AArch64/bcmp.ll
415–428

Good catch. In general, all of the XOR, OR and CMP use ALU ports, so data dependency will become the bottleneck on high end CPU.
If so, an additional parameter is needed to guard the max number of xors ? Or some other suggestion?

bcl5980 added inline comments.Nov 10 2022, 8:28 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8565

I believe the leaf node needn't one-use. It will not increase the instruction count.

llvm/test/CodeGen/AArch64/bcmp.ll
415–428

I'm also not sure if we need a max leaf node limitation. Max size of bcmp expand is 64bytes. So larger size also needn't worry about it.

Allen updated this revision to Diff 474914.Nov 11 2022, 7:23 PM

Delete the one-use of leaf node XOR and add a limitation of xor nodes

Allen marked an inline comment as done.Nov 11 2022, 7:28 PM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8565

Done, thanks

llvm/test/CodeGen/AArch64/bcmp.ll
415–428

Thanks, so I add the max limitation number 6 for xors now.
If we can get more schedule model info, we may relex th e condition later.

dmgreen added inline comments.Nov 12 2022, 2:03 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8591

This code doesn't handle float compares so you shouldn't need the IsStrict stuff. Maybe only call this from LowerSETCC if the Opcode is ISD::SETCC or LHS.getValueType().isInteger().

llvm/test/CodeGen/AArch64/bcmp.ll
415–428

I can see what you mean, but I'm not sure we need to limit this case. In my experience this much of reduction in instruction count can be good for performance on its own, even if it turns the tree into a series of dependant ccmp. We could theoretically have larger trees though, so maybe put the limit to higher?

Unless you have performance results that actually shows it to be worse, we believe it is better on most cpus so I would have it perform the transform in this case. 8 less instructions is probably always worth 1 longer critical patch length.

Allen updated this revision to Diff 474937.Nov 12 2022, 4:19 AM
Allen marked 2 inline comments as done.

Adjust the max limitation 6->16

Allen marked 2 inline comments as done.Nov 12 2022, 4:23 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8591
  • ok. delete th IsStrict stuff
  • the callsize from performSETCCCombine is still need as brcond+setcc will be combine into br_cc in the stage Optimized type-legalized selection, which is before the Legalized selection DAG, such as case br_on_cmp_i128_ne in file CodeGen/AArch64/i128-cmp.ll
llvm/test/CodeGen/AArch64/bcmp.ll
415–428

Thanks, I adjust the max limitation 6->16, and this case it enable the transform (I don't have a machine with 4 width ALU port to test this case)

There is no GlobalISel code?

Allen marked an inline comment as done.Nov 12 2022, 5:45 AM

There is no GlobalISel code?

I'm not familiar with GlobalISel, but I can try to complement this it with a new independent patch. It would be better if you could direct me which interface I need to focus on:)

dmgreen accepted this revision.Nov 12 2022, 8:53 AM

Thanks for the update. LGTM.

This revision is now accepted and ready to land.Nov 12 2022, 8:53 AM
This revision was landed with ongoing or failed builds.Nov 12 2022, 7:06 PM
This revision was automatically updated to reflect the committed changes.