This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] recognize popcount implemented in hacker's delight.
ClosedPublic

Authored by shchenz on Sep 29 2019, 12:56 AM.

Details

Summary

Try to recognize below popcount implemented in hacker's delight:

int popcount32(unsigned i) {
  i = i - ((i >> 1) & 0x55555555);
  i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
  i = ((i + (i >> 4)) & 0x0F0F0F0F);
  return (i * 0x01010101) >> 24; 
}

This helps platforms which support harware popcount instruction(eg: PowerPC) get some gain for benchmark deepsjeng of cpu2017.

Diff Detail

Event Timeline

shchenz created this revision.Sep 29 2019, 12:56 AM

I think this should to into AgressiveInstCombine.

shchenz updated this revision to Diff 222317.Sep 29 2019, 2:22 AM

address @lebedev.ri comment

Maybe we need to add TargetTransformInfo in InstCombiner to make sure this combination only happens when getPopcntSupport is true.

Atleast on X86 it is not needed, since popcount expander expands intrinsic to exactly this pattern.

https://godbolt.org/z/i-rswr

This comment was removed by xbolva00.

Maybe we need to add TargetTransformInfo in InstCombiner to make sure this combination only happens when getPopcntSupport is true.

Atleast on X86 it is not needed, since popcount expander expands intrinsic to exactly this pattern.

https://godbolt.org/z/i-rswr

Thanks, yes it should be ok on powerpc and X86. But I am afraid there are some platforms which do not have hardware popcount, and we folding many arithmetical operations into one intrinsic, thus we may miss some combination/canonicalization opportunities for these arithmetical operations on such platform. I am not sure ^-^

Maybe we need to add TargetTransformInfo in InstCombiner to make sure this combination only happens when getPopcntSupport is true.

Atleast on X86 it is not needed, since popcount expander expands intrinsic to exactly this pattern.

https://godbolt.org/z/i-rswr

Thanks, yes it should be ok on powerpc and X86. But I am afraid there are some platforms which do not have hardware popcount, and we folding many arithmetical operations into one intrinsic, thus we may miss some combination/canonicalization opportunities for these arithmetical operations on such platform. I am not sure ^-^

No. This is a canonicalization pass. Native LLVM IR intrinsic is more canonical than some IR blob.

Please add m_OneUse everywhere.

xbolva00 added inline comments.Sep 30 2019, 5:39 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
297

Recognized

Please add m_OneUse everywhere.

@xbolva00 Do we need to add m_OneUse for every operation? If some instruction like first lshr (%2 = lshr i32 %0, 1) has other use, it may still a win by doing this transformation.
Here if we do not consider m_OneUse issue, the worst case is: all instructions have other uses, and we can only replace final lshr (%13 = lshr i32 %12, 24) with popcount intrinsic. One instruction vs One intrinsic, I think it may be not a lose?

xbolva00 added a comment.EditedSep 30 2019, 7:21 AM

I meant something like in foldAnyOrAllBitsSet. Final lshr is ok to have multiple uses, all partial computations should be m_Oneuse.

I meant something like in foldAnyOrAllBitsSet. Final lshr is ok to have multiple uses, all partial computations should be m_Oneuse.

I don't think that's necessary for this transform. The difference is we are only creating a single instruction for this transform. Therefore, we can always replace the final instruction in the sequence with some other instruction without producing any extra instructions overall.

spatel added inline comments.Sep 30 2019, 7:52 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
269–271

This might actually be easier to read if generalized for more bitwidths. Can we use APInt and handle power-of-2 sizes from 8- to 128-bit?
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

spatel added inline comments.Sep 30 2019, 8:01 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
269–271

For reference, we know it's safe to do that transform more generally because the backend handles the expansion back to this pattern more generally:
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/TargetLowering.cpp#L5923

(That expansion is why it is safe to do this transform in IR without a TTI hook in the first place; we don't expect any regressions because we should reverse the transform.)

jsji added a subscriber: ppc-slack.Sep 30 2019, 1:14 PM
xbolva00 added inline comments.Oct 1 2019, 6:52 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
270

Type size - 8?

vector support?

lebedev.ri requested changes to this revision.Oct 6 2019, 11:01 AM
lebedev.ri added a reviewer: kparzysz.
lebedev.ri added a subscriber: kparzysz.

I agree it would be best to generalize this.
There was a previous patch that already did that (D45173) and it was generic IIRC, but it got stuck and is now under wrong license.
Maybe @kparzysz is willing to relicense it thought?

This revision now requires changes to proceed.Oct 6 2019, 11:01 AM

Yes, D45173 recognizes some standards forms of popcount, but it can not recognize the one in benchmark deepsjeng. There are many forms of popcount, this one is TargetLowering::expandCTPOP choose to expand. So I guess we should add some specific combination code to recognize it?

shchenz updated this revision to Diff 223801.Oct 8 2019, 1:50 AM

address comments.

shchenz marked 3 inline comments as done.Oct 8 2019, 1:55 AM
shchenz added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
270

When type size is 8, we don't need the final lshr, so we can not do it in instcombine for opcode lshr, but we can still do it based on final mul.
Currently I left it as a follow-up issue until a real world case found. Hope this is ok.

Thanks, new patch looks great.

shchenz updated this revision to Diff 223992.Oct 8 2019, 11:51 PM

minor fix for testcase

xbolva00 accepted this revision.Oct 9 2019, 1:39 PM

Ok for me

craig.topper added inline comments.
llvm/include/llvm/IR/PatternMatch.h
664–670

Can we std::move this into specific_intval constructor and then std::move it again in the class. Otherwise we're making multiple heap allocations whenever the value is more the 64 bits.

shchenz updated this revision to Diff 224234.Oct 9 2019, 6:48 PM
shchenz marked an inline comment as done.

avoid unnescessary heap allocations in APInt copy constructor.

shchenz edited the summary of this revision. (Show Details)Oct 9 2019, 6:50 PM
shchenz added a reviewer: craig.topper.
shchenz added inline comments.
llvm/include/llvm/IR/PatternMatch.h
664–670

Right. Thanks for pointing it out.

craig.topper added inline comments.Oct 10 2019, 10:14 PM
llvm/test/Transforms/AggressiveInstCombine/popcount.ll
3

Why does this run the entire -O3 pipeline and not just the aggressive instcombine pass?

This revision was not accepted when it landed; it landed in state Needs Review.Oct 10 2019, 10:14 PM
This revision was automatically updated to reflect the committed changes.
shchenz marked an inline comment as done.Oct 10 2019, 10:28 PM
shchenz added inline comments.
llvm/test/Transforms/AggressiveInstCombine/popcount.ll
3

Thanks, done in rL374514