Page MenuHomePhabricator

[InstCombine] Remove decanonicalizing transforms of selects
AbandonedPublic

Authored by mkazantsev on Apr 19 2018, 10:15 PM.

Details

Summary

Currently InstCombine tends to recognize patterns like

%cmp = icmp slt iA %x, 0
%select = select i1 %cmp, iB %A, iB %B

as bit-test against highest bit, and convert it into one of multiple shift-based patterns
which may look differently depending on types iA and iB and values of %A and %B.
It only happens if %A and %B are constants. We observe at least three possible sutiatons:

Case 1:

%c2 = icmp slt %a, 0
%select1 = select %c2, -1, %Greater

converts into

(%a s>> 31) | %Greater

Case 2:

%c2 = icmp slt i32 %a, 0
%select1 = select i1 %c2, i32 %Less, i32 %Greater

converts into

(%a s>> 31) - (%Less - %Greater) + %Greater

or, if it is the equivalent, to

(%a s>> 31) - (%Less - %Greater) | %Greater

Case 3:

%c2 = icmp slt i64 %a, 0
%select1 = select i1 %c2, i32 %Less, i32 %Greater

converts to

(trunc (%a s>> 63) to i32) - (%Less - %Greater) + %Greater

There are possibly more variatons of this. The general idea is that select by comparison with 0 or -1
gets transformed into multiple different patterns, and very small changes in conditions and constants
may change this pattern. There is no clear benefits of that transforms, but the obvious downside of them
is decanonicalization. For example, we have a matcher of three-ways comparison that expects to see the
pattern of the following form:

%c1 = icmp eq %a, 0
%c2 = icmp slt %a, 0
%select1 = select %c2, %Less, %Greater
%select2 = select %c1, %Equal, %select2

This pattern is widely seen across the code: it is a canonical representation of three-ways comparators
that return -1, 0 or 1 if %a is less, equal or greater than zero. In the result of transforms described in
cases 1, 2, 3, the pair of %c2 and %select1 gets transformed into something different (with shifts, casts
and different bit operations). So the three-way comparison pattern cannot be recognized anymore.

This patch removes the transforms that make us replace select by comparison against zero with all this bit
magic. We want to prefer selects over random bit magic as a more canonical form of doing that: select
has more transparent semantics and other parts of InstCombine make use it rather than try to recognize
that some sequence of non-trivial bit instructions actually represents a select by comparison against zero.

It is codegen's work to turn selects into bit operations in the end of pipeline to calculate it efficiently if
shifts are faster than cmoves, but doing so too early has no benifits and has obvious pessimization of other
pattern matchers within instcombine.

In this patch we only stop harmful transforms from happening. The code that turns all this bit magic back
to selects will go in a separate patch.

Diff Detail

Event Timeline

mkazantsev created this revision.Apr 19 2018, 10:15 PM
mkazantsev edited the summary of this revision. (Show Details)
lebedev.ri added inline comments.Apr 19 2018, 11:19 PM
test/Transforms/InstCombine/select-of-bittest.ll
662 ↗(On Diff #143226)

Please use utils/update_test_checks.py.

mkazantsev marked an inline comment as done.

Formatted tests.

lebedev.ri added inline comments.Apr 20 2018, 2:58 AM
test/Transforms/InstCombine/select-of-bittest.ll
660 ↗(On Diff #143266)

For each of those, could you please add vector tests.
Something like:

define i32 @compare_to_zero_mismatched_types_idiomatic_trunc(i64 %x) {
  %cmp = icmp slt i64 %x, 0
  %select1 = select i1 %cmp, i32 -1, i32 1
  ret i32 %select1
}

define <2 x i32> @compare_to_zero_mismatched_types_idiomatic_trunc_splatvec(<2 x i64> %x) {
  %cmp = icmp slt <2 x i64> %x, <i64 0, i64 0>
  %select1 = select <2 x i1> %cmp, <i32 -1, i32 -1>, <i32 1, i32 1>
  ret <2 x i32> %select1
}

define <3 x i32> @compare_to_zero_mismatched_types_idiomatic_trunc_splatvec_undef(<3 x i64> %x) {
  %cmp = icmp slt <3 x i64> %x, <i64 0, i64 undef, i64 0>
  %select1 = select <3 x i1> %cmp, <i32 -1, i64 undef, i32 -1>, <i32 1, i64 undef, i32 1>
  ret <3 x i32> %select1
}

; May be more than one
define <3 x i32> @compare_to_zero_mismatched_types_idiomatic_trunc_nonsplatvec(<3 x i64> %x) {
  %cmp = icmp slt <3 x i64> %x, <i64 0, i64 undef, i64 0>
  %select1 = select <3 x i1> %cmp, <i32 -1, i64 0, i32 1>, <i32 1, i64 0, i32 -1>
  ret <3 x i32> %select1
}

I don't think we should be doing any of these select-of-constant transforms in instcombine.

It's worse for code analysis, more IR instructions, and may be detrimental to perf. Think about the cases where a conditional move executes at the same speed as a simple add (Ryzen?) or we have profile data for the compare, so branch prediction is perfect.

There's lots of code that does this kind of thing in the DAG, and that's where I think it belongs (using target hooks as needed). There was some discussion about this on llvm-dev here:
https://groups.google.com/forum/#!topic/llvm-dev/pid_thv2X-A

So I think we should be removing some of these transforms from instcombine rather than adding to them.

Other than that inline comment, i think it would be nice to commit
the baseline tests (as of trunk), so the effect of this proposal could be observed.
Even if the code changes won't land, this would at least document the current behavior.

test/Transforms/InstCombine/select-of-bittest.ll
660 ↗(On Diff #143266)

Oh, also, this is a really bad place to add these tests.
They should either go to a new file, or at least before

; ============================================================================ ;
; Negative tests. Should not be folded.
; ============================================================================ ;

I don't think we should be doing any of these select-of-constant transforms in instcombine.

It's worse for code analysis, more IR instructions, and may be detrimental to perf. Think about the cases where a conditional move executes at the same speed as a simple add (Ryzen?) or we have profile data for the compare, so branch prediction is perfect.

There's lots of code that does this kind of thing in the DAG, and that's where I think it belongs (using target hooks as needed). There was some discussion about this on llvm-dev here:
https://groups.google.com/forum/#!topic/llvm-dev/pid_thv2X-A

So I think we should be removing some of these transforms from instcombine rather than adding to them.

There is a second side to this though.

Even if all such "performance-degrading" transforms are removed from instcombine
(yes, i think the pass is rather huge and monolithic, this is a problem), it won't solve the problem.
The same 'bad' patterns surely could be produced via some other way.

I don't see how this is any different than lowering intrinsics.
Of course, they are no longer intrinsics, but IR, so optimizations may
break their canonical form, and backend will need to be adjusted
to recognize the IR patters (and potentially recognize the patterns
that did not originate form intrinsics, which is great).

So i'd say this really is backend's (DAGCombine?) problem.
This is just a question of whether or not there is interest to have a high quality codegen,
regardless of the input. It should recognize the 'bad' patterns, and if profitable,
transform to improve 'performance'. In this case, back to select.

And yes, absolutely, this is more complicated than just "stop dealing with it in instcombine" :)

I don't think we should be doing any of these select-of-constant transforms in instcombine.

It's worse for code analysis, more IR instructions, and may be detrimental to perf. Think about the cases where a conditional move executes at the same speed as a simple add (Ryzen?) or we have profile data for the compare, so branch prediction is perfect.

There's lots of code that does this kind of thing in the DAG, and that's where I think it belongs (using target hooks as needed). There was some discussion about this on llvm-dev here:
https://groups.google.com/forum/#!topic/llvm-dev/pid_thv2X-A

So I think we should be removing some of these transforms from instcombine rather than adding to them.

There is a second side to this though.

Even if all such "performance-degrading" transforms are removed from instcombine
(yes, i think the pass is rather huge and monolithic, this is a problem), it won't solve the problem.
The same 'bad' patterns surely could be produced via some other way.

I don't see how this is any different than lowering intrinsics.
Of course, they are no longer intrinsics, but IR, so optimizations may
break their canonical form, and backend will need to be adjusted
to recognize the IR patters (and potentially recognize the patterns
that did not originate form intrinsics, which is great).

So i'd say this really is backend's (DAGCombine?) problem.
This is just a question of whether or not there is interest to have a high quality codegen,
regardless of the input. It should recognize the 'bad' patterns, and if profitable,
transform to improve 'performance'. In this case, back to select.

And yes, absolutely, this is more complicated than just "stop dealing with it in instcombine" :)

I agree. Let me clarify the suggestion: don't just remove these control-flow to data-flow transforms from instcombine. Canonicalize to the cmp+select form. And then let the DAG convert it back if it's profitable. I did some fraction (what I thought were the most common parts) of this job already following the cited llvm-dev discussion. Here's the direct link to that reply:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105373.html

mkazantsev added a reviewer: bkramer.EditedApr 22 2018, 6:49 PM

Hi @spatel @lebedev.ri,

Actually I am absolutely agreed that these transforms produce really bad code which looks like a canonicalization breach. Let me give you a bit more context about my motivation for this change. I basically need that the following pattern was recognized correctly as three-way comparison:

%cmp1 = icmp eq %x, 0
%cmp2 = icmp slt %x, 0
%s1 = select i1 %cmp2, -1, 1
%s2 = select i1 %cmp1, 0, %s1

It is a typical pattern that we see in all comparators. We have a function matchThreeWayIntCompare that used to detect such a pattern. However at some point InstCombine started recognizing the pattern of %cmp2, %s1 as a bit test of the highest bit, removed it and inserted the shift-based bit test instead. Even worse, it does insert *different* patterns for case when both %x and %s1 are i32 or when they are of different types.

The problem is wide-spread across InstCombine: we seem to have *many* transforms of this kind spread across the code. And which of them will be applied is always a surprise: some of them are implemented for general case and some are limited. As result, we might end up with a dozen of different bit-test patterns that all represent a three-ways comparison.

This patch was an attempt to canonicalize all these transforms, so that at least for the same pattern they were done in the same manner regardless of types. Here I was taking an assumption like "okay, we are doing all this bit-test stuff anyways, let's at least do it uniformly". Because now what we have is a counter-canonicalization of this pattern.

I agree that in ideal world we should just remove all this job connected to adding bit-tests instead of this select, or even do it in opposite way (so that bit-tests are converted to select). I see two reasons why it is impossible.

  1. We have no idea how it impacts performance in the particular cases for which they were added.
  2. We have little understanding of how other transforms and optimizations will react on that, and will we break something or not. I have a crawling suspicion that all this stuff was done for vectorization, and that the vectorizer might expect this bit stuff rather than selects. I need someone familiar with vectorizer's code to confirm or refute that.

I will be totally happy if we decide to not transform selects into bittests. But I am hesitant to make a change this big in the part of code I am unfamiliar with, as well as I am unfamiliar with the reasons why it was done at all and whether it helps the vectorization. So I need the confirmation from vectorizer people that selects there are OK.

For now, my point is the following: yes, we are *already* doing all this bit magic. Yes, it looks like miscanonicalization, and we can discuss getting rid of all such transforms at all. But since we are already doing it and haven't yet decided to clear it off, let's at least do it well. Once we decide that all this stuff is counter-canonical, we can happily remove all this. What do you think of that?

Thanks,
Max

P.S. I've also added Benjamin Kramer who was the original author of this transform to take a look.

You can also have more context of why it was done from the commit messages of: D45854 D45855 D45856 D45863.

For now, my point is the following: yes, we are *already* doing all this bit magic. Yes, it looks like miscanonicalization, and we can discuss getting rid of all such transforms at all. But since we are already doing it and haven't yet decided to clear it off, let's at least do it well. Once we decide that all this stuff is counter-canonical, we can happily remove all this. What do you think of that?

I understand this as a short-term goal and local improvement, but I think if we don't reverse the direction now, we'll never correct it. For example, it took over 5 years to correct:
rL159230
with
rL319964

See also the mentioned commits where we may extend the DAG transforms and possibly solve the patterns that you are looking at:
rL296977
rL311731

You mention the vectorizers as a possible source for wanting to do the bit magic, but I think that most targets would actually do better vectorizing a select at this point because vsel/bsl/blend type of instructions are usually part of a vector ISA. This came up in:
https://bugs.llvm.org/show_bug.cgi?id=6773#c9 (and I'm not locating the review thread, but Roman has proposed tests for the vector cases)

Can you provide an IR -> asm example for one/some of the problem cases that led you to this (or preferably, file bug reports for them)? If we can fix the cases that we know would regress with a reversal of this IR canonicalization, then we should push ahead with that effort and fix instcombine. We can't guarantee that there won't be regressions, but we should be able to fix problems as they are reported.

Also see D24480 and the list of proposals and commits mentioned there.

Can you provide an IR -> asm example for one/some of the problem cases that led you to this (or preferably, file bug reports for them)? If we can fix the cases that we know would regress with a reversal of this IR canonicalization, then we should push ahead with that effort and fix instcombine. We can't guarantee that there won't be regressions, but we should be able to fix problems as they are reported.

Sure, please take a look at https://bugs.llvm.org/show_bug.cgi?id=37147
You can also use tests from D45854 D45855 D45856 D45863 for reference: this is the IR we expect InstCombine to produce.

Just for note: I am not aware of any asm advantages or disadvantages of what we have now in practice. I am only speculating about it in assumption that this bit-test magic was done with some reason. I don't have any real evidence that it is somehow good or bad in asm, but what I know is that does pessimize three-way comparison recognition.

Just for note: I am not aware of any asm advantages or disadvantages of what we have now in practice. I am only speculating about it in assumption that this bit-test magic was done with some reason. I don't have any real evidence that it is somehow good or bad in asm, but what I know is that does pessimize three-way comparison recognition.

I forgot about this, but I think I asked about the same problem as PR37147 here:
http://lists.llvm.org/pipermail/llvm-dev/2017-July/114885.html

...and the conclusion was that we want selects (or at least nobody strongly opposed that). And so I went to the DAG and made the necessary changes to allow producing good asm for all targets for those patterns (ie, there's a hook to convert those back to bit math). So there should be little risk in removing/inverting the instcombines at this point.

I'm looking closer now at how to replace some of the instcombines now without regressing anything. I expect this is going to take multiple steps to unravel. For example, we even have a 'm_Signum' matcher that's looking for shifts.

mkazantsev retitled this revision from [InstCombine] Relax restriction in foldSelectInstWithICmp for sake of smaller code size to [InstCombine] Remove decanonicalizing transforms of selects.
mkazantsev edited the summary of this revision. (Show Details)

Let's try an alternative approach. In this patch we remove harmful transforms that replace selects with random pieces of bit magic.

Here we only stop harmful transforms. The patch that will canonicalize all this bit magic to selects will go separately.

Let's try an alternative approach. In this patch we remove harmful transforms that replace selects with random pieces of bit magic.

Here we only stop harmful transforms. The patch that will canonicalize all this bit magic to selects will go separately.

I agree with this direction (and I have a patch that would do something similar), but this is going too far. We should remove the transforms that are harmful, but without creating known regressions. This requires writing new, more specific folds as we remove the over-reaching logic.

As you can see from test15g and test15h, there are good folds buried in here. Those correspond to something like this:

Name: set_bit_if_masked_val_is_clear
Pre: isPowerOf2(C1) && ((C2 ^ C3) == C1) && (C2 u> C3)
 %t1 = and i8 %x, C1
 %t2 = icmp eq i8 %t1, 0
 %r = select i1 %t2, i8 C2, i8 C3
=>
 %r = xor i8 %t1, C2

Name: set_bit_if_masked_val_is_set
Pre: isPowerOf2(C1) && ((C2 ^ C3) == C1) && (C2 u> C3)
 %t1 = and i8 %x, C1
 %t2 = icmp ne i8 %t1, 0
 %r = select i1 %t2, i8 C2, i8 C3
=>
 %r = or i8 %t1, C3

Name: clear_bit_if_masked_val_is_clear
Pre: isPowerOf2(C1) && ((C2 ^ C3) == C1) && (C3 u> C2)
 %t1 = and i8 %x, C1
 %t2 = icmp eq i8 %t1, 0
 %r = select i1 %t2, i8 C2, i8 C3
=>
 %r = or i8 %t1, C2
 
Name: clear_bit_if_masked_val_is_set
Pre: isPowerOf2(C1) && ((C2 ^ C3) == C1) && (C3 u> C2)
 %t1 = and i8 %x, C1
 %t2 = icmp ne i8 %t1, 0
 %r = select i1 %t2, i8 C2, i8 C3
=>
 %r = xor i8 %t1, C3

https://rise4fun.com/Alive/XG7

To verify that we're doing this properly, we need to add more tests for patterns like this. The current set isn't providing very good coverage.

I am confused. Didn't we agree on that replacement of selects with bit-wise logic, if it's profitable, should be a part of backend's work, and here we want selects because they have a more clear semantics that can be used by other transforms, rather than teaching every single transform to recognize the bit magic as select? You are giving some examples where codegen definitely could do what you are showing.

I am confused. Didn't we agree on that replacement of selects with bit-wise logic, if it's profitable, should be a part of backend's work, and here we want selects because they have a more clear semantics that can be used by other transforms, rather than teaching every single transform to recognize the bit magic as select? You are giving some examples where codegen definitely could do what you are showing.

  1. I think it's clearly better for IR to eliminate the select if we can reduce the instruction count too. It can lead to more reduction here in instcombine as seen in test15g. So that's why I'm proposing to add optimizations for these patterns: https://rise4fun.com/Alive/XG7 in D46086 (note that test15g is not regressed there).
  2. Even if you disagree with having those transforms, I don't think we should remove it as the first step. We're already going to be at risk for revert due to perf regressions with D46086 even though those are all IR improvements based on our reasoning so far. I'm purposely trying to minimize the diffs in that step, so we have a better chance of pushing this through.
  1. I think it's clearly better for IR to eliminate the select if we can reduce the instruction count too.

Not if local improvement on one step will break another step which will break us even more improvement. You save one instruction by doing bit magic and kill the further transform that would eliminate 4 instructions. My point is that the canonicalization is generally a more useful thing in a long run than saving of one or two instructions now and pessimizing another more powerful transform later. Current strategy of saving one instruction by adding this bit magic produces horrible code on the tests I've added.

As for test15g, I will give some more thought to what can be done to it.

So that's why I'm proposing to add optimizations for these patterns: https://rise4fun.com/Alive/XG7 in D46086 (note that test15g is not regressed there).

Why in InstCombine? These transforms are very simple, DAG selector should be able to do what you want, but later. The only possible profit I see from having this done in InstCombine is that it may affect the behavior of inlining/unrolling that calculate cost of the code, but it's only matter of tuning the cost functions if the need shows up.

Current strategy of saving one instruction by adding this bit magic produces horrible code on the tests I've added.

Please commit these tests, so we can see the current output and compare any differences between this patch and the other patch.

mkazantsev added a comment.EditedApr 27 2018, 9:43 PM

Current strategy of saving one instruction by adding this bit magic produces horrible code on the tests I've added.

Please commit these tests, so we can see the current output and compare any differences between this patch and the other patch.

Here it is. Please compare the cases when we compare against zero with cases when we compare against any other value.
https://reviews.llvm.org/rL331100

Updated the patch to show the impact better. Test 15g possibly needs closer evaluation, however it seems to be the only case where the code has become worse. Might be a separate issue.

Current strategy of saving one instruction by adding this bit magic produces horrible code on the tests I've added.

Please commit these tests, so we can see the current output and compare any differences between this patch and the other patch.

Here it is. Please compare the cases when we compare against zero with cases when we compare against any other value.
https://reviews.llvm.org/rL331100

Thanks! I updated D46086. Unless I've missed it, the diffs for the 3-way-compare tests are identical?

Thanks @spatel , pretty much so!

mkazantsev abandoned this revision.May 2 2018, 6:44 PM

Abandoning in favour of D46086

spatel added inline comments.Nov 29 2018, 11:09 AM
test/Transforms/InstCombine/unrecognized_three-way-comparison.ll
11–13

Not sure exactly where the logic hole is that caused this, but I moved a fold to InstSimplify in:
rL347896
...and now we get this case and the similar 'compare_against_arbitrary_value_type_mismatch' test.