Current code use xor+or pattern to expand memcmp, but it is not efficient on AArch64.
This patch adds a new option PreferCmpToExpand, use cmp+or pattern to expand.
Fix: #56543
Paths
| Differential D136672
[ExpandMemCmp][AArch64] Add a new option PreferCmpToExpand in inMemCmpExpansionOptions and enable on AArch64 AbandonedPublic Authored by bcl5980 on Oct 25 2022, 2:53 AM.
Details
Diff Detail Event Timelinebcl5980 retitled this revision from [ExpandMemCmp][AArch64] Add a new option PreferCmpToExpand in inMemCmpExpansionOptions and enable or AArch64 to [ExpandMemCmp][AArch64] Add a new option PreferCmpToExpand in inMemCmpExpansionOptions and enable on AArch64.Oct 25 2022, 2:53 AM Comment Actions Is this the same as https://reviews.llvm.org/D136244? Looks like there are two tickets for the same issue. They both solve it differently. D136244 looks like it is more general - handling more cases then just bcmp. But this might optimize larger bcmp's? I think I would expect them to become a series of cmp;ccmp;ccmp;ccmp when optimized properly though. Comment Actions
I have rebase to the code already include D136244. These two tickets looks the same but it looks D136244 can't fix the original case bcmp 3bytes. Comment Actions
(gdb) p LHS.getOperand(0)->getOpcode() == ISD::XOR && LHS.getOperand(1)->getOpcode() == ISD::XOR $16 = false (gdb) p LHS.dump() t20: i32 = or t46, t50 $17 = void (gdb) p LHS.getOperand(0).dump() t46: i32 = and t44, Constant:i32<65535> $18 = void (gdb) p LHS.getOperand(1).dump() t50: i32 = and t48, Constant:i32<255> Comment Actions
Of course, we can do it on AArch64. And I guess the code pattern match will be more complicated and long, you need to detect more 3 patterns: or (and (xor a, b), C1), (xor c, d) or (xor a, b), (and (xor c, d), C2) or (and (xor a, b), C1), (and (xor c, d), C2) And you should only consider eq 0 when and invovle. And C1, C2 value also involve some different result. For now, what I want to discuss is: Do we need this patch? Comment Actions From the looks of the tests the AND should be optimized out. My guess would be that we don't revisit the setcc after the and have been simplified though. It might work better if it was run during lowering.
I think, at the moment, we are heading towards ISel folds to turn the the eor patterns into ccmp. If we don't think that method will work (or is too complicated), then this patch would be useful. But it should be more general to fix it in ISel, as it should capture more cases than just bcmp. Comment Actions I happen to have been looking into https://github.com/llvm/llvm-project/issues/56543 (which I believe is what this is trying to solve), and came across a possible fix from a slightly different angle. It seems LLVM already knows how to optimize to the desired (handwritten code) for that issue, but the patterns (some zero extend folds) needed to do so are being prevented by this XOR simplification (in DAGCombiner.cpp visitXOR) : // Simplify: xor (op x...), (op y...) -> (op (xor x, y)) if (N0Opcode == N1.getOpcode()) if (SDValue V = hoistLogicOpWithSameOpcodeHands(N)) return V; If you comment that fold LLVM will optimize the memcmp as desired. So I think it's just a matter of adding/or adjusting some of these zero extend optimizations to handle xor being simplified like this.
Revision Contents
Diff 470425 llvm/include/llvm/Analysis/TargetTransformInfo.h
llvm/lib/CodeGen/ExpandMemCmp.cpp
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
llvm/test/CodeGen/AArch64/bcmp-inline-small.ll
llvm/test/CodeGen/AArch64/bcmp.ll
|