This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Recognize idioms for ctpop and ctlz
Needs ReviewPublic

Authored by kparzysz on Apr 2 2018, 12:03 PM.

Details

Summary

Look for code that does the population count and count leading zeros that corresponds to the typical bit-twiddling algorithms from Hacker's Delight.

Diff Detail

Repository
rL LLVM

Event Timeline

kparzysz created this revision.Apr 2 2018, 12:03 PM
kparzysz updated this revision to Diff 140663.Apr 2 2018, 12:17 PM

Added comments.

Does this trigger on the compiler-rt implementation of popcountsi2/clzsi2? I guess it technically isn't a problem if it does, given the LLVM backend currently doesn't generate calls to them, but it might be worth adding a backend testcase to make sure we don't generate an infinite loop.

lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
157

You might want to match patterns which don't include the subtraction; it could easily get combined away before you get here (if someone computes "ctlz(n)-1", or "cltz(x)-ctlz(y)", etc.).

Does this trigger on the compiler-rt implementation of popcountsi2/clzsi2? I guess it technically isn't a problem if it does, given the LLVM backend currently doesn't generate calls to them, but it might be worth adding a backend testcase to make sure we don't generate an infinite loop.

Would it make sense to check the function name for these two above and exit early if it matches?

kparzysz updated this revision to Diff 140693.Apr 2 2018, 2:35 PM

Removed the explicit check for subtraction when checking ctlz pattern.

Added a check to avoid optimizing compiler-rt functions.

kparzysz marked an inline comment as done.Apr 2 2018, 2:35 PM
craig.topper added inline comments.Apr 2 2018, 9:36 PM
lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
62

m_APInt doesn't match a specific APInt. It just matches any ConstantInt or splat and returns a pointer to the APInt it found. The pointer passed in would normally be unitialized in the caller.

99

Could this make use of APInt::getSplat?

107

Spell this out? 'SA' isn't meaningful without looking at the lambda being called.

kparzysz updated this revision to Diff 140770.Apr 3 2018, 6:33 AM

Addressed comments on the previous diff.

kparzysz marked 3 inline comments as done.Apr 3 2018, 6:34 AM
kparzysz added inline comments.
lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
62

Wow, this was bad. Thanks for catching this.

craig.topper added inline comments.Apr 3 2018, 1:20 PM
lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
53

Should take APInt M by const reference if possible to avoid a costly copy if its larger than 64 bits.

74

APInt::isSameValue is most useful when the APInts have different widths. Is that the case here? If not we should just use operator==

195

What if the ctpop returns a vector type?

Also use update_test_checks.py to generate the CHECK lines in the tests.

kparzysz marked 3 inline comments as done.Apr 3 2018, 2:04 PM
kparzysz added inline comments.
lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
74

Yes, this can legitimately happen.

kparzysz updated this revision to Diff 140862.Apr 3 2018, 2:05 PM

Addressed comments.

I think we need to evaluate what popcount sequences we want to handle. The code you're handling isn't the most optimal version

For example compiler-rt uses this

su_int x = (su_int)a;
x = x - ((x >> 1) & 0x55555555);
/* Every 2 bits holds the sum of every pair of bits */
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
/* Every 4 bits holds the sum of every 4-set of bits (3 significant bits) */
x = (x + (x >> 4)) & 0x0F0F0F0F;
/* Every 8 bits holds the sum of every 8-set of bits (4 significant bits) */
x = (x + (x >> 16));
/* The lower 16 bits hold two 8 bit sums (5 significant bits).*/
/*    Upper 16 bits are garbage */
return (x + (x >> 8)) & 0x0000003F;  /* (6 significant bits) */

Then there is another form here that uses a multiply in the last step.

https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

What do you propose? The ones from the webpage, plus what compiler-rt does?

I'm not sure what I propose. The code in compiler-rt is in hackers delight but the last set of shifts and adds are in a different order. Did you have a real world application or benchmark that was motivating this change?

I think it came from a customer application. I tried to put it in instcombine a long time back, but it was rejected on the grounds that it was too time-consuming. It's been sitting around in my local repo until I thought about committing it again, this time in the aggressive instcombine which didn't exist back then. Since I already have it, I didn't want to delete it.

Recognizing more patterns is doable, but it can take some time.

Even back then there was some interference with the other parts of instcombine, and this code had to be put in the right place to avoid it. The problem was that instcombine would "preprocess" this computation into a form that was no longer "recursively symmetric" and the code as is, would no longer work.

I don't know what the long-term strategy is for dealing with the interactions between pattern-recognition code and instcombine. This is a reoccurring issue for the polynomial multiplication code in HexagonLoopIdiomRecognition and it likely to affect any code of that nature.

spatel added a comment.Apr 4 2018, 7:33 AM

I don't know what the long-term strategy is for dealing with the interactions between pattern-recognition code and instcombine. This is a reoccurring issue for the polynomial multiplication code in HexagonLoopIdiomRecognition and it likely to affect any code of that nature.

I don't know if there's an actual strategy. There's no formal definition of 'canonical IR' AFAIK, so we continue to simplify code via peepholes in instcombine. Anything downstream of that has to adjust to those changes. I've dealt with that many times as an interaction between instcombine and DAG combine.

Let's look at pop32 as an example. We have 3 non-loop variations to consider so far IIUC: (a) all add ops (Hacker's Delight), (b) replace first add+mask with sub, and (c) replace ending mask+shift+add with multiply.

As IR, these are:

define i32 @pop32_all_adds(i32 %x) {
  %v0 = and i32 %x, 1431655765
  %v1 = lshr i32 %x, 1
  %v2 = and i32 %v1, 1431655765
  %v3 = add nuw i32 %v0, %v2
  %v4 = and i32 %v3, 858993459
  %v5 = lshr i32 %v3, 2
  %v6 = and i32 %v5, 858993459
  %v7 = add nuw nsw i32 %v4, %v6
  %v8 = and i32 %v7, 117901063
  %v9 = lshr i32 %v7, 4
  %v10 = and i32 %v9, 117901063
  %v11 = add nuw nsw i32 %v8, %v10
  %v12 = and i32 %v11, 983055
  %v13 = lshr i32 %v11, 8
  %v14 = and i32 %v13, 983055
  %v15 = add nuw nsw i32 %v12, %v14
  %v16 = and i32 %v15, 31
  %v17 = lshr i32 %v15, 16
  %v18 = add nuw nsw i32 %v16, %v17
  ret i32 %v18
}

define i32 @pop32_sub(i32 %x) {
  %shr = lshr i32 %x, 1
  %and = and i32 %shr, 1431655765
  %sub = sub i32 %x, %and
  %shr1 = lshr i32 %sub, 2
  %and2 = and i32 %shr1, 858993459
  %and3 = and i32 %sub, 858993459
  %add = add nuw nsw i32 %and2, %and3
  %shr4 = lshr i32 %add, 4
  %add5 = add nuw nsw i32 %shr4, %add
  %and6 = and i32 %add5, 252645135
  %shr7 = lshr i32 %and6, 16
  %add8 = add nuw nsw i32 %shr7, %and6
  %shr9 = lshr i32 %add8, 8
  %add10 = add nuw nsw i32 %shr9, %add8
  %and11 = and i32 %add10, 63
  ret i32 %and11
}

define i32 @pop32_mul(i32 %x) {
  %shr = lshr i32 %x, 1
  %and = and i32 %shr, 1431655765
  %sub = sub i32 %x, %and
  %and1 = and i32 %sub, 858993459
  %shr2 = lshr i32 %sub, 2
  %and3 = and i32 %shr2, 858993459
  %add = add nuw nsw i32 %and3, %and1
  %shr4 = lshr i32 %add, 4
  %add5 = add nuw nsw i32 %shr4, %add
  %and6 = and i32 %add5, 252645135
  %mul = mul i32 %and6, 16843009
  %shr7 = lshr i32 %mul, 24
  ret i32 %shr7
}

First, do we have consensus on which of these is canonical? Generally, we prefer the form with less instructions (pop32_mul), but is the instruction count reduction justified by using a mul?
Second, can we add instcombines that would reduce one or more of these to another form? If so, let's add those. If not, then this pass needs to match all of those forms to be effective (but that doesn't have to happen in one patch of course).

I don't know if there's an actual strategy. There's no formal definition of 'canonical IR' AFAIK, so we continue to simplify code via peepholes in instcombine. Anything downstream of that has to adjust to those changes. I've dealt with that many times as an interaction between instcombine and DAG combine.

This isn't sustainable in the long run. Recognizing complex computations and replacing them with short equivalents (such as intrinsics that targets may provide efficient implementations of) is arguably better than only doing peephole optimizations, and yet the current model makes it really difficult to write such code.

spatel added a comment.Apr 4 2018, 1:15 PM

I don't know if there's an actual strategy. There's no formal definition of 'canonical IR' AFAIK, so we continue to simplify code via peepholes in instcombine. Anything downstream of that has to adjust to those changes. I've dealt with that many times as an interaction between instcombine and DAG combine.

This isn't sustainable in the long run. Recognizing complex computations and replacing them with short equivalents (such as intrinsics that targets may provide efficient implementations of) is arguably better than only doing peephole optimizations, and yet the current model makes it really difficult to write such code.

There was some discussion about optimal graph rewriting, but I don't know if there's any work/progress on that yet.

Until then, I think we're seeing the alternatives of the current model in this patch: either we add code to instcombine and coordinate this pass with instcombine's preferred form, or we increase the pattern matching complexity here...or we acknowledge that it's impossible to match all the variants, and let it slide.

FWIW, here are reductions of the patterns that we could transform in instcombine, but I suspect we don't want to add such narrow transforms there. It's probably better to keep the specialized pattern matching cost and complexity here:

; https://rise4fun.com/Alive/0ej

Name: 2_bit_sum
  %v0 = and i32 %x, 1431655765 ; 0x55555555
  %v1 = lshr i32 %x, 1
  %v2 = and i32 %v1, 1431655765
  %v3 = add i32 %v0, %v2
=>
  %v1 = lshr i32 %x, 1
  %v2 = and i32 %v1, 1431655765
  %v3 = sub i32 %x, %v2

; https://rise4fun.com/Alive/ly5

Name: shift_add_to_mul
  %s1 = and i32 %x, 252645135 ; 0x0f0f0f0f
  %s2 = lshr i32 %s1, 16
  %s3 = add i32 %s1, %s2
  %s4 = lshr i32 %s3, 8
  %s5 = add i32 %s3, %s4
  %r = and i32 %s5, 63 ; 0x3f
=>
  %s1 = and i32 %x, 252645135 ; 0x0f0f0f0f
  %m1 = mul i32 %s1, 16843009 ; 0x01010101
  %r = lshr i32 %m1, 24

This isn't sustainable in the long run. Recognizing complex computations and replacing them with short equivalents (such as intrinsics that targets may provide efficient implementations of) is arguably better than only doing peephole optimizations, and yet the current model makes it really difficult to write such code.

Recognizing every pattern which is equivalent to popcount is basically impossible. (I mean, you could theoretically integrate something like Alive into LLVM to recognize any straight-line code which is equivalent to popcount, but that's probably not practical.) So you have to come up with some specific set of patterns to recognize, based on what the source code looks like and what transforms run before your pattern-recognition code.

As you've noted, running "instcombine" before your pattern-recognition code is going to be a constant source of trouble, because it isn't a fixed set of transforms, and it runs very early in the pass pipeline. So either you run your pattern-recognition before instcombine, or you add some tests to the LLVM regression tests which will break if a new transform breaks your pattern-recognition, and hope for the best. I guess it might help a bit if we came up with some restricted criteria for allowed transforms in instcombine, and put the rest into aggressive-instcombine. But we're never going to completely freeze the optimization pipeline, so there's fundamentally some maintenance burden.

As you've noted, running "instcombine" before your pattern-recognition code is going to be a constant source of trouble, because it isn't a fixed set of transforms, and it runs very early in the pass pipeline. So either you run your pattern-recognition before instcombine, or you add some tests to the LLVM regression tests which will break if a new transform breaks your pattern-recognition, and hope for the best. I guess it might help a bit if we came up with some restricted criteria for allowed transforms in instcombine, and put the rest into aggressive-instcombine. But we're never going to completely freeze the optimization pipeline, so there's fundamentally some maintenance burden.

The problem right now is that with both, instcombine and complex pattern matching, ongoing development of instcombine will incur very high maintenance costs on the pattern matching code. This is what I'd like to avoid. I'd like to have both optimizations, but without the interference. Running pattern recognition first sounds like the least invasive change and would hopefully keep everybody happy.

kparzysz updated this revision to Diff 141367.Apr 6 2018, 9:33 AM

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

Would a change like the one in PassManagerBuilder.cpp be acceptable? There may be some work to make sure that the pre-existing components of the aggressive instcombine still apply, but I'm wondering if this direction is something we could agree on.

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

I'm curious to know, while this does sidestep the issue of having to adapt to the instcombine/earlier passes being smarter, how does this not either
a) significantly increase the amount of matching this pass needs to do (due to being in the earlier position in the pipe), or
b) reduce the number of patterns this pass will handle?

Unless of course the only desire is to match exactly those patterns that are being tested, and exactly those patterns only...

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

I'm curious to know, while this does sidestep the issue of having to adapt to the instcombine/earlier passes being smarter, how does this not either [...]

The patterns that we're trying to match correspond to some "typical implementations", so it's not intended to detect any possible computation that, for example, calculates population count. Different popcount algorithms would obviously need different matching codes. That's not a loss, since instcombine cannot now and likely will never be able to reorganize entire algorithms into a common way. The big problem with instcombine now is that its ongoing development results in a continuous stream of small changes to the output which then require ongoing adjustments to the matching code. Look at HexagonLoopIdiomRecognition.cpp to see the extent to which it goes to try and see through what instcombine has done.

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

I'm curious to know, while this does sidestep the issue of having to adapt to the instcombine/earlier passes being smarter, how does this not either [...]

The patterns that we're trying to match correspond to some "typical implementations", so it's not intended to detect any possible computation that, for example, calculates population count.

Ah, i see. Disappointing. That design decision should really be specified in the documentation.

To point the obvious, if one takes one of these tests, runs those through instcombine
and then runs aggressiveinstcombine, chances are they will no longer be matched...
(Well, the main point being, any non-expected, slightly different pattern, that i would normally expected to be slightly canonicalized beforehand),

Also, while that may be a net positive tradeoff for this particular combine (BitCountCombine),
is that also true for all other possible future combines in this pass?

Different popcount algorithms would obviously need different matching codes. That's not a loss, since instcombine cannot now and likely will never be able to reorganize entire algorithms into a common way. The big problem with instcombine now is that its ongoing development results in a continuous stream of small changes to the output which then require ongoing adjustments to the matching code. Look at HexagonLoopIdiomRecognition.cpp to see the extent to which it goes to try and see through what instcombine has done.

To add to that, i think you should be able to get rid of at least some of the pain of detection
of when instcombine got smarter and these folds no longer match, by adding kind-of end-to-end optimization tests.

I.e. i don't see why you could not add a second run-line so it is something like:

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -aggressive-instcombine -S < %s | FileCheck %s --check-prefixes=PLAIN
; RUN: opt -instcombine -aggressive-instcombine -S < %s | FileCheck %s --check-prefixes=POSTINSTCOMBINE

Caveats:

  • I'm not sure there is much (any?) precedent for such end-to-end testing in LLVM
  • I'm not 100% sure update_test_checks.py already supports more than one run-line
  • Whoever adds a instcombine fold that breaks aggressive-instcombine fold, will have to fix aggressive-instcombine fold. But that is a good thing i suppose :)

To add to that, i think you should be able to get rid of at least some of the pain of detection
of when instcombine got smarter and these folds no longer match, by adding kind-of end-to-end optimization tests.

Detection is not a problem.

To point the obvious, if one takes one of these tests, runs those through instcombine
and then runs aggressiveinstcombine, chances are they will no longer be matched...
(Well, the main point being, any non-expected, slightly different pattern, that i would normally expected to be slightly canonicalized beforehand),

I'm still not sure what your argument is. The idea here is that the aggressive instcombine (or generally, complex pattern matching code) will always run before instcombine. Yes, the fact that instcombine will change something in the code being matched is exactly what this idea is trying to avoid. Instcombine is a continuously evolving pass that keeps changing the IR. Any matching code running after it needs to be continuously updated to reflect those changes. This is a big problem for any non-trivial pattern matching. Running such pattern matching before instcombine does not eliminate the need for updates (if the preceding passes change), but significantly reduced the frequency of such updates.