Recognize the pattern for population count in the instruction combiner and replace it with a call to the corresponding intrinsic.
The pattern is based on that in "Hacker's Delight".
Details
- Reviewers
majnemer
Diff Detail
- Repository
- rL LLVM
Event Timeline
Removed all the unnecessary (or unwanted) options from the function attributes in the unit tests.
Unless I am mistaken, the ctpop and ctlz pieces of this code are independent of one another. Please split the patch into two patches, one for each transform.
lib/Transforms/InstCombine/InstCombineAddSub.cpp | ||
---|---|---|
1181–1183 | Seeing as how this is a rather rare occurrence for adds, can you sort this to the bottom? The Simplify... naming convention implies that no new instruction needs to be created which is not the case for this transform. I'd call it optimizeCtpopIdiom or something similar. | |
test/Transforms/InstCombine/ctpop-match.ll | ||
57 | Please CHECK a little more thoroughly, it would be nice to see the call to ctpop consume the argument, etc. | |
95–97 | This test case doesn't look reduced. |
They are not completely independent. The pattern for ctlz relies on having the call to Intrinsic::ctpop already in the code.
Addresses all the comments except for splitting the patch into two. The ctpop part does not depend on ctlz, but ctlz utilizes ctpop.
Are you ok with keeping both parts together, or would you still prefer to have them separated?
From previous comments it looks like this patch should just contain the ctpop patterns and a followup patch adds the ctlz work.
An additional followup could be to recognise the pattern cttz(x) = ctpop((x & -x)- 1)
Additionally, is there any chance that this could be tweaked to detect the same ctpop / lzcnt patterns for vector types as well?
Yes, and I asked if it's still ok for these two parts to go together. I can split it if there is a strong preference for that.
I haven't thought about the vector versions and I haven't seen any code that does this for vectors, so I'm not sure what patterns to look for. I don't know if the pattern for scalars would also apply to vectors.
I can add cttz.
lib/Transforms/InstCombine/InstCombineAddSub.cpp | ||
---|---|---|
1060 | I think it would be better if the comment above matchCtpopW explained what matchCtpopW does, and a comment above optimizeToCtpop explained what optimizeToCtpop does. The: // ... If BW is less than the bit-width of the type of V, then // BI == ctpop(V) if the bits of V beyond BW are zero. is a bit confusing here because it is describing behavior implemented in a different function. | |
1158 | Instead of checking for known zero bits, why not transform this into a ctpop(v & mask)? If the upper bits are known to be zero, then the mask will go away in the next instcombine iteration regardless. |
Does the rest of the optimization pipeline benefit from this canonicalization? If not, I would recommend moving this to CGP.
To answer David's question: the main reason why it's here is that after the instruction combiner, the code is transformed in a way that makes the detection of the popcnt pattern more difficult. At this stage, there is some regularity in the IR that can be exploited to check for popcnt patterns of varying widths.
As far as downstream optimizations are involved---this transformation will reduce code size by quite a bit, and any optimization that can be limited by code size could potentially be enabled by this change.
lib/Transforms/InstCombine/InstCombineAddSub.cpp | ||
---|---|---|
1158 | Because then the pattern is not really computing the population count. The instructions in the pattern operate on the whole value, including the high bits, and these bits contribute to the final result. |
I am concerned that you will slow down InstCombine by a considerable margin by checking for PopCnt. Most of the time, the work done to perform this optimization will not be fruitful.
We've recently been through a similar odyssey for bswap/bitreverse (see r257875). We managed to move that logic to CGP without too much drama.
I think it would be better if the comment above matchCtpopW explained what matchCtpopW does, and a comment above optimizeToCtpop explained what optimizeToCtpop does. The:
is a bit confusing here because it is describing behavior implemented in a different function.