This is an archive of the discontinued LLVM Phabricator instance.

[ExpandMemCmp][AArch64] Add a new option PreferCmpToExpand in inMemCmpExpansionOptions and enable on AArch64
AbandonedPublic

Authored by bcl5980 on Oct 25 2022, 2:53 AM.

Details

Summary

Current code use xor+or pattern to expand memcmp, but it is not efficient on AArch64.
This patch adds a new option PreferCmpToExpand, use cmp+or pattern to expand.

Fix: #56543

Diff Detail

Event Timeline

bcl5980 created this revision.Oct 25 2022, 2:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 2:53 AM
bcl5980 requested review of this revision.Oct 25 2022, 2:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 2:53 AM
bcl5980 retitled this revision from [ExpandMemCmp][AArch64] Add a new option PreferCmpToExpand in inMemCmpExpansionOptions and enable or AArch64 to [ExpandMemCmp][AArch64] Add a new option PreferCmpToExpand in inMemCmpExpansionOptions and enable on AArch64.Oct 25 2022, 2:53 AM

Is this the same as https://reviews.llvm.org/D136244? Looks like there are two tickets for the same issue.

They both solve it differently. D136244 looks like it is more general - handling more cases then just bcmp. But this might optimize larger bcmp's? I think I would expect them to become a series of cmp;ccmp;ccmp;ccmp when optimized properly though.

bcl5980 updated this revision to Diff 471345.Oct 27 2022, 6:06 PM

rebase code

Is this the same as https://reviews.llvm.org/D136244? Looks like there are two tickets for the same issue.

They both solve it differently. D136244 looks like it is more general - handling more cases then just bcmp. But this might optimize larger bcmp's? I think I would expect them to become a series of cmp;ccmp;ccmp;ccmp when optimized properly though.

I have rebase to the code already include D136244. These two tickets looks the same but it looks D136244 can't fix the original case bcmp 3bytes.
And not only AArch64, maybe this change also help some potential platform.

Allen added a comment.Oct 27 2022, 6:50 PM

I have rebase to the code already include D136244. These two tickets looks the same but it looks D136244 can't fix the original case bcmp 3bytes.
And not only AArch64, maybe this change also help some potential platform.

  • For the case bcmp3, the combine pattern is mismatch as we restrict to ISD::XOR, so the ISD::AND may be enhanced too ?
(gdb) p LHS.getOperand(0)->getOpcode() == ISD::XOR && LHS.getOperand(1)->getOpcode() == ISD::XOR
$16 = false
(gdb) p LHS.dump()
t20: i32 = or t46, t50
$17 = void
(gdb) p LHS.getOperand(0).dump()
t46: i32 = and t44, Constant:i32<65535>
$18 = void
(gdb) p LHS.getOperand(1).dump()
t50: i32 = and t48, Constant:i32<255>
bcl5980 added a comment.EditedOct 27 2022, 7:34 PM

I have rebase to the code already include D136244. These two tickets looks the same but it looks D136244 can't fix the original case bcmp 3bytes.
And not only AArch64, maybe this change also help some potential platform.

  • For the case bcmp3, the combine pattern is mismatch as we restrict to ISD::XOR, so the ISD::AND may be enhanced too ?
(gdb) p LHS.getOperand(0)->getOpcode() == ISD::XOR && LHS.getOperand(1)->getOpcode() == ISD::XOR
$16 = false
(gdb) p LHS.dump()
t20: i32 = or t46, t50
$17 = void
(gdb) p LHS.getOperand(0).dump()
t46: i32 = and t44, Constant:i32<65535>
$18 = void
(gdb) p LHS.getOperand(1).dump()
t50: i32 = and t48, Constant:i32<255>

Of course, we can do it on AArch64. And I guess the code pattern match will be more complicated and long, you need to detect more 3 patterns:

or (and (xor a, b), C1), (xor c, d)
or (xor a, b), (and (xor c, d), C2)
or (and (xor a, b), C1), (and (xor c, d), C2)

And you should only consider eq 0 when and invovle. And C1, C2 value also involve some different result.
I guess there will be some tasks if you want to start.

For now, what I want to discuss is: Do we need this patch?

From the looks of the tests the AND should be optimized out. My guess would be that we don't revisit the setcc after the and have been simplified though. It might work better if it was run during lowering.

For now, what I want to discuss is: Do we need this patch?

I think, at the moment, we are heading towards ISel folds to turn the the eor patterns into ccmp. If we don't think that method will work (or is too complicated), then this patch would be useful. But it should be more general to fix it in ISel, as it should capture more cases than just bcmp.

bcl5980 abandoned this revision.Oct 31 2022, 2:39 AM

I happen to have been looking into https://github.com/llvm/llvm-project/issues/56543 (which I believe is what this is trying to solve), and came across a possible fix from a slightly different angle.

It seems LLVM already knows how to optimize to the desired (handwritten code) for that issue, but the patterns (some zero extend folds) needed to do so are being prevented by this XOR simplification (in DAGCombiner.cpp visitXOR) :

// Simplify: xor (op x...), (op y...)  -> (op (xor x, y))
if (N0Opcode == N1.getOpcode())
  if (SDValue V = hoistLogicOpWithSameOpcodeHands(N))
    return V;

If you comment that fold LLVM will optimize the memcmp as desired.

So I think it's just a matter of adding/or adjusting some of these zero extend optimizations to handle xor being simplified like this.