This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] foldXorOfICmps(): don't give up on non-single-use ICmp's if all users are freely invertible
ClosedPublic

Authored by lebedev.ri on Jul 31 2019, 12:36 PM.

Details

Summary

This is rather unconventional..

As the comment there says, we don't have much folds for xor-of-icmps,
we try to turn them into an and-of-icmps, for which we have plenty of folds.
But if the ICmp we need to invert is not single-use - we give up.

As discussed in https://reviews.llvm.org/D65148#1603922,
we may have a non-canonical CLAMP pattern, with bit match and
select-of-threshold that we'll potentially clamp.
As it can be seen in canonicalize-clamp-with-select-of-constant-threshold-pattern.ll,
out of all 8 variations of the pattern, only two are not canonicalized into
the variant with and+icmp instead of bit math.
The reason is because the ICmp we need to invert is not single-use - we give up.

We indeed can't perform this fold at will, the general rule is that
we should not increase instruction count in InstCombine,

But we wouldn't end up increasing instruction count if we can adapt every other
user to the inverted value. This way the not we create will get folded,
and in the end the instruction count did not increase.

For that, of course, we need to look at the users of a Value,
which is again rather unconventional for InstCombine :S

Thus i'm proposing to be a little bit more insistive in foldXorOfICmps().
The alternatives would be to not create that not, but add duplicate code to
manually invert all users; or to add some even less general combine to handle
some more specific pattern[s].

Diff Detail

Event Timeline

lebedev.ri created this revision.Jul 31 2019, 12:36 PM

Rebased, NFC.
Ping.

Sorry for the delay in looking at this. What do the motivating examples look like for codegen? Are we getting the optimal codegen for these clamps, or would we better off trying to create min/max and/or saturating intrinsics?

Make sure I didn't typo this translation, but I think I'm seeing extra instructions for vectors with this transform on x86 and aarch64:

define <4 x i32> @t4_select_cond_xor_v0_vec_before(<4 x i32> %X) {
  %need_to_clamp_positive = icmp sgt <4 x i32> %X, <i32 32767, i32 32767, i32 32767, i32 32767>
  %dont_need_to_clamp_negative = icmp sgt <4 x i32> %X, <i32 -32768, i32 -32768, i32 -32768, i32 -32768>
  %clamp_limit = select <4 x i1> %need_to_clamp_positive, <4 x i32> <i32 32767, i32 32767, i32 32767, i32 32767>, <4 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768>
  %dont_need_to_clamp = xor <4 x i1> %need_to_clamp_positive, %dont_need_to_clamp_negative
  %R = select <4 x i1> %dont_need_to_clamp, <4 x i32> %X, <4 x i32> %clamp_limit
  ret <4 x i32> %R
}

define <4 x i32> @t4_select_cond_xor_v0_vec_after(<4 x i32> %X) {
  %t1 = icmp slt <4 x i32> %X, <i32 32768, i32 32768, i32 32768, i32 32768>
  %t2 = select <4 x i1> %t1, <4 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768>, <4 x i32> <i32 32767, i32 32767, i32 32767, i32 32767>
  %t3 = add <4 x i32> %X, <i32 32767, i32 32767, i32 32767, i32 32767>
  %t4 = icmp ult <4 x i32> %t3, <i32 65535, i32 65535, i32 65535, i32 65535>
  %t5 = select <4 x i1> %t4, <4 x i32> %X, <4 x i32> %t2
  ret <4 x i32> %t5
}

Sorry for the delay in looking at this. What do the motivating examples look like for codegen? Are we getting the optimal codegen for these clamps, or would we better off trying to create min/max and/or saturating intrinsics?

We most certainly do *NOT* get optimal codegen with these patterns, as noted in https://reviews.llvm.org/D65148#1603922 by @dmgreen.
We need to fold them into traditional clamp pattern, D65765 does just that.

The motivation behind *this* patch is that out of all the variations of this pattern that we can have (see canonicalize-clamp-with-select-of-constant-threshold-pattern.ll),
only this single pattern is not folded into the same form all the other variants are.
So if we don't do this, i will need to redo D65765 to do also handle this non-canonical pattern, and i don't think it would be the correct solution.

...

lebedev.ri added a comment.EditedAug 12 2019, 9:17 AM

Make sure I didn't typo this translation, but I think I'm seeing extra instructions for vectors with this transform on x86 and aarch64:
...

Sure, but that is irrelevant. D65765 improves upon both of them:
https://godbolt.org/z/-UIddL

Make sure I didn't typo this translation, but I think I'm seeing extra instructions for vectors with this transform on x86 and aarch64:
...

Sure, but that is irrelevant. D65765 improves upon both of them:
https://godbolt.org/z/-UIddL

Ok, but I wouldn't call it irrelevant if this patch is viewed/committed independently. For example, if D65765 had to be reverted, we'd probably want to revert this too, so there's no perf regression potential.

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2654–2658

'V can be' -> 'V be'

2654–2659

Is there a way to share logic with the existing IsFreeToInvert()? Can we put them next to each other in 1 file, so it's easier to discover that both exist?

2725

it's -> its

2732

inversible -> invertible

lebedev.ri marked 5 inline comments as done.

Addressed review notes, thanks for taking a look!

Make sure I didn't typo this translation, but I think I'm seeing extra instructions for vectors with this transform on x86 and aarch64:
...

Sure, but that is irrelevant. D65765 improves upon both of them:
https://godbolt.org/z/-UIddL

Ok, but I wouldn't call it irrelevant if this patch is viewed/committed independently. For example, if D65765 had to be reverted, we'd probably want to revert this too, so there's no perf regression potential.

Right.
What i meant is, every other variant of the pattern in canonicalize-clamp-with-select-of-constant-threshold-pattern.ll
is already 'folded'/bad, so so this only handles an edge-case, not generally worsens things.

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2654–2659

Right, i should have put it next to IsFreeToInvert().

spatel accepted this revision.Aug 13 2019, 5:16 AM

LGTM

llvm/lib/Transforms/InstCombine/InstCombineInternal.h
176

Use the current formatting: canFreely...
It should be safe to update IsFreeToInvert() too for uniformity.

This revision is now accepted and ready to land.Aug 13 2019, 5:16 AM

LGTM

Thank you for the review!