halves the number of CNT instructions generated
Details
Diff Detail
Event Timeline
(If you're not using Arcanist, please upload patches with full context, -U100000.)
| llvm/test/CodeGen/AArch64/popcount.ll | ||
|---|---|---|
| 10 | Why are we generating two loads here? Something related to the BITCAST legalization? | |
| llvm/test/CodeGen/AArch64/popcount.ll | ||
|---|---|---|
| 10 | Yes, it should be:
| |
| llvm/test/CodeGen/AArch64/popcount.ll | ||
|---|---|---|
| 10 | Yes, it is going to AArch64ISelLowering.cpp:14006 case ISD::LOAD: {
assert(SDValue(N, 0).getValueType() == MVT::i128 &&
"unexpected load's value type");
LoadSDNode *LoadNode = cast<LoadSDNode>(N);
if (!LoadNode->isVolatile() || LoadNode->getMemoryVT() != MVT::i128) {
// Non-volatile loads are optimized later in AArch64's load/store
// optimizer. // <======This is not happening
return;
}
SDValue Result = DAG.getMemIntrinsicNode(
AArch64ISD::LDP, SDLoc(N),
DAG.getVTList({MVT::i64, MVT::i64, MVT::Other}),
{LoadNode->getChain(), LoadNode->getBasePtr()}, LoadNode->getMemoryVT(),
LoadNode->getMemOperand());
SDValue Pair = DAG.getNode(ISD::BUILD_PAIR, SDLoc(N), MVT::i128,
Result.getValue(0), Result.getValue(1));
Results.append({Pair, Result.getValue(2) /* Chain */});
return;
} | |
| llvm/test/CodeGen/AArch64/popcount.ll | ||
|---|---|---|
| 10 | With -O0 it outputs: ldr x8, [x0, #8]
ldr d0, [x0]
// implicit-def: $q1
mov v1.16b, v0.16b
mov v1.d[1], x8 | |
| llvm/test/CodeGen/AArch64/popcount.ll | ||
|---|---|---|
| 10 | Hmm. Really, I think the problem reduces to something like the following, which generates a similar ldr+add+ld1 sequence: define <2 x i64> @z(i64* nocapture nonnull readonly %p) {
%b = load i64, i64* %p
%p2 = getelementptr i64, i64* %p, i64 1
%bb = load i64, i64* %p2
%r1 = insertelement <2 x i64> zeroinitializer, i64 %b, i32 0
%r2 = insertelement <2 x i64> %r1, i64 %bb, i32 1
ret <2 x i64> %r2
}X86ISelLowering.cpp has some code specifically to handle this; see EltsFromConsecutiveLoads. Maybe some of it should be ported to AArch64. | |
| llvm/test/CodeGen/AArch64/popcount.ll | ||
|---|---|---|
| 10 | Why is this not optimized to: define <2 x i64> @z(i64* nocapture nonnull readonly %p) {
%b = load i128, i128* %p, align 8
%r2 = bitcast <2 x i64> %r1, i128 %bb, i32 1
ret <2 x i64> %r2
} | |
LGTM
| llvm/test/CodeGen/AArch64/popcount.ll | ||
|---|---|---|
| 10 | IR optimizations of that sort are very limited at the moment. Maybe something to look into for the vectorcombine pass. It might be hard for IR-level optimizations to catch all the interesting cases anyway, though, given how much of shuffle lowering happens in SelectionDAG. | |
Why are we generating two loads here? Something related to the BITCAST legalization?