This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] custom lowering for i128 popcount
ClosedPublic

Authored by shawnl on Jun 7 2020, 8:23 AM.

Details

Summary

halves the number of CNT instructions generated

Diff Detail

Event Timeline

shawnl created this revision.Jun 7 2020, 8:23 AM

(If you're not using Arcanist, please upload patches with full context, -U100000.)

llvm/test/CodeGen/AArch64/popcount.ll
10

Why are we generating two loads here? Something related to the BITCAST legalization?

shawnl added inline comments.Jun 7 2020, 1:09 PM
llvm/test/CodeGen/AArch64/popcount.ll
10

Yes, it should be:

ldr q0, [x0]

shawnl marked an inline comment as done.Jun 7 2020, 2:25 PM
shawnl added inline comments.
llvm/test/CodeGen/AArch64/popcount.ll
10

Yes, it is going to AArch64ISelLowering.cpp:14006

case ISD::LOAD: {
  assert(SDValue(N, 0).getValueType() == MVT::i128 &&
         "unexpected load's value type");
  LoadSDNode *LoadNode = cast<LoadSDNode>(N);
  if (!LoadNode->isVolatile() || LoadNode->getMemoryVT() != MVT::i128) {
    // Non-volatile loads are optimized later in AArch64's load/store
    // optimizer.    // <======This is not happening
    return; 
  }

  SDValue Result = DAG.getMemIntrinsicNode(
      AArch64ISD::LDP, SDLoc(N),
      DAG.getVTList({MVT::i64, MVT::i64, MVT::Other}),
      {LoadNode->getChain(), LoadNode->getBasePtr()}, LoadNode->getMemoryVT(),
      LoadNode->getMemOperand());

  SDValue Pair = DAG.getNode(ISD::BUILD_PAIR, SDLoc(N), MVT::i128,
                             Result.getValue(0), Result.getValue(1));
  Results.append({Pair, Result.getValue(2) /* Chain */});
  return;
}
shawnl marked an inline comment as not done.Jun 7 2020, 2:34 PM
shawnl added inline comments.
llvm/test/CodeGen/AArch64/popcount.ll
10

With -O0 it outputs:

	ldr	x8, [x0, #8]
	ldr	d0, [x0]
                                        // implicit-def: $q1
	mov	v1.16b, v0.16b
	mov	v1.d[1], x8
efriedma added inline comments.Jun 7 2020, 8:28 PM
llvm/test/CodeGen/AArch64/popcount.ll
10

Hmm. Really, I think the problem reduces to something like the following, which generates a similar ldr+add+ld1 sequence:

define <2 x i64> @z(i64* nocapture nonnull readonly %p) {
  %b = load i64, i64* %p
  %p2 = getelementptr i64, i64* %p, i64 1
  %bb = load i64, i64* %p2
  %r1 = insertelement <2 x i64> zeroinitializer, i64 %b, i32 0
  %r2 = insertelement <2 x i64> %r1, i64 %bb, i32 1
  ret <2 x i64> %r2
}

X86ISelLowering.cpp has some code specifically to handle this; see EltsFromConsecutiveLoads. Maybe some of it should be ported to AArch64.

shawnl updated this revision to Diff 269133.Jun 8 2020, 2:20 AM
shawnl marked an inline comment as not done.

update tests so review is not distracted from another optimization bug

shawnl marked an inline comment as done.Jun 8 2020, 2:21 AM
shawnl marked an inline comment as done.Jun 8 2020, 3:32 AM
shawnl added inline comments.
llvm/test/CodeGen/AArch64/popcount.ll
10

Why is this not optimized to:

define <2 x i64> @z(i64* nocapture nonnull readonly %p) {
  %b = load i128, i128* %p, align 8
  %r2 = bitcast <2 x i64> %r1, i128 %bb, i32 1
  ret <2 x i64> %r2
}
efriedma accepted this revision.Jun 8 2020, 2:08 PM

LGTM

llvm/test/CodeGen/AArch64/popcount.ll
10

IR optimizations of that sort are very limited at the moment. Maybe something to look into for the vectorcombine pass.

It might be hard for IR-level optimizations to catch all the interesting cases anyway, though, given how much of shuffle lowering happens in SelectionDAG.

This revision is now accepted and ready to land.Jun 8 2020, 2:08 PM
shawnl closed this revision.Jun 25 2020, 1:44 AM

This was commited