Page MenuHomePhabricator

[X86] Lowering addus/subus intrinsics to native IR (LLVM part)
ClosedPublic

Authored by tkrupa on Apr 27 2018, 1:35 AM.

Details

Summary

This revision improves previous version (rL330322) which has been reverted due to crashes.

This is the patch that lowers x86 intrinsics to native IR
in order to enable optimizations. The patch also includes folding
of previously missing saturation patterns so that IR emits the same
machine instructions as the intrinsics.
Lowering in clang: https://reviews.llvm.org/D46892

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The test case no longer crashes after adding the check of operands' type before extension and after truncation.
Currently detectAddSubSatPattern sometimes detects pattern even when both operands are constant. In some cases it would be more efficient not to introduce X86ISD::ADDS etc node (such as in the test case chandlerc provided in the https://bugs.llvm.org/show_bug.cgi?id=37260). Would it be better to not detect it ever if both args are constant?

Do we have test coverage for PR37260?

craig.topper added inline comments.Apr 30 2018, 10:56 AM
lib/Target/X86/X86ISelLowering.cpp
36147 ↗(On Diff #144295)

Move this comment into the if body so it doesn't interfere with the closing curly brace from the above 'if'

36153 ↗(On Diff #144295)

Can you use APInt::isSignedIntN/isIntN here?

36177 ↗(On Diff #144295)

What if the SIGN_EXTEND/ZERO_EXTEND was from an even smaller type? You can't just remove it. You need to narrow it. Always emitting an ISD::TRUNCATE may be sufficient. DAG combine should be able to simplify it.

36183 ↗(On Diff #144295)

Oh I see, you covered for that case here. But do we really need to do this? Can we just make smaller extends?

tkrupa marked 3 inline comments as done.May 2 2018, 3:59 AM
tkrupa added inline comments.
lib/Target/X86/X86ISelLowering.cpp
36183 ↗(On Diff #144295)

No - detectSSatPattern and detectUSatPattern check for min/max values of destination type (VT here), so we cannot change that.
However, if element type before extension is smaller than in VT, the overflow/underflow never occurs. Can we just emit ISD::ADD/ISD::SUB in such cases?

Can you take another look at fixing the psubusb/psubusw fast-isel cases please?

What do we want to do about the concerns raised in https://reviews.llvm.org/D45721 by @sroland ?

tkrupa added a subscriber: DavidKreitzer.EditedMay 9 2018, 9:51 AM

We're currently discussing the possible solutions for the JIT pipeline issue with @DavidKreitzer.

We're currently discussing the possible solutions for the JIT pipeline issue with @DavidKreitzer.

Thanks for taking a look. I mean, it was quite easy to adapt our code when the intrinsics removed had more or less obvious, not complicated fallbacks (which we most likely already had in our code anyway), such as the sse2 pmin/pmax or pabs functions, but it gets really cumbersome (with quite some potential for issues due to the requirement to match what autoupgrade would do exactly) with the ones which are rather complex to emulate.

I'm also concerned that the more complex patterns are easier for other optimizations to simplify a little and break. The simpler things like pmin/pmax or pabs aren't so bad to not match when they get optimized a little.

tkrupa added a comment.EditedMay 10 2018, 1:27 AM

About test/CodeGen/X86/avx2-intrinsics-fast-isel.ll change - there is a canonical form for subus pattern I'm using here - it's different from adds/addus/subs patterns. While those three use ext/trunc pattern and fold correctly, subus has only max+sub. If fast-isel is enabled, sub is not put into SelectionDAG - that's what prevents it from combining. Instead, it's appended after isel as a lowered node. Is there a machine instruction pass for combining?

I'm also concerned that the more complex patterns are easier for other optimizations to simplify a little and break. The simpler things like pmin/pmax or pabs aren't so bad to not match when they get optimized a little.

I was wondering about that too.
Also, another concern I actually have is that I believe some of these sequences are very suboptimal. Noone (at least when thinking about how to do it fast) would do them that way if trying to emulate this manually.
The subus looks great (we're doing the same as fallback), it's just cmp/select/sub.
The addus is simply terrible.
We're doing min((unsigned)(a, b xor ~0)) + b instead - that is only xor/cmp/select/add.
(Another (better) solution, and that is what is typically used to detect overflow for unsigned adds (apart from the select of course), is (unsigned)(a + b) < (unsigned)a ? ~0 : a + b - only add / cmp / select)
We're generally avoiding sext/trunc sequences (typically no good comes from doing that...) so for signed saturated sub/add we're using some more complex sequences (using a couple more cmp/select/sub (or add)). Not entirely sure if it would actually result in better code or not if you really have to emulate it, but the sext/trunc can be a problem in itself (don't ask me how many shuffle instructions it would generate on sse2 only...) and of course the add/cmp/select is really times 2 instructions due to running on twice as wide vectors in the end. These are really quite complex to emulate.
So maybe for the unsigned saturated add/sub, when using optimal patterns it wouldn't be too bad, without too much concern about not being able to recognize them again in the end. But not sure about the signed ones.

During internal review I proposed something like that (~0 - a) < b ? ~0 : a+b.
It gets rid of zext/trunc in addus pattern but introduces additional subtraction which is presumably more costly. Your solution seems much better, I'll change it to that form.

tkrupa updated this revision to Diff 146864.May 15 2018, 10:02 AM
tkrupa added a subscriber: mike.dvoretsky.

As was decided, I removed lowering in Autoupgrade.cpp. I also moved the tests into corresponding *target*-intrinsics-canonical.ll files as discussed with @mike.dvoretsky. I added tests for PR37260 in test/CodeGen/X86/avx2-intrinsics-canonical.ll.
What's your opinion on fast-isel subus emmision and my proposition in detectAddSubSatPattern function?

FWIW as said I think could live with autoupgrade of at least the unsigned sat ones. These have similar complexity to pabs (pabs is cmp/sel/neg whereas these are add(or sub)/cmp/sel, although the add one also requires a constant), so if there were no concerns about the possibility of optimizations breaking pabs recognition, there might not be any here neither, if it actually works (which doesn't always seem to be the case according to the test cases?). (Although personally I'd have liked to keep pabs intrinsics too...)
Though disappearing intrinsics causing a null call in the generated code rather than some error when compiling remains a concern - but certainly I know about this one now so easily debugged :-).
Other than that, there's a bug in the logic of the addus, otherwise looks reasonable to me, though I'm not an expert on llvm internals.

lib/Target/X86/X86ISelLowering.cpp
32275–32277 ↗(On Diff #146864)

This isn't quite right.
For x < x+y this is false for y == 0 but it wouldn't be an overflow.
So it must be x <= x+y ? x+y : ~0 (or some reversed form like x > x+y ? ~0 : x+y - I suppose it doesn't make a difference in the end)

test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
2585–2587 ↗(On Diff #146864)

Err, so it doesn't actually recognize the pattern?

tkrupa updated this revision to Diff 147000.EditedMay 16 2018, 12:56 AM

I brought back lowering in AutoUpgrade for unsigned intrinsics, although I'm not sure if there's a universal agreement about it. In email correspondence @DavidKreitzer
suggested that it would be better to retain the intrinsics and get rid of them in another time/patch, which Craig disagreed to and argued we can lower non-complex ones (like addus/subus) without much trouble. Is there a consensus about it?

Fixed logic error for addus recognition - thanks for pointing that out. My recognition is more primitive than already existing subus one - should I expand it for those special cases or just leave a TODO?

Could you address the question at lib/Target/X86/X86ISelLowering.cpp:36183?

tkrupa added inline comments.May 16 2018, 12:57 AM
test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
2585–2587 ↗(On Diff #146864)

Please look at my comment from Thu, May 10, 10:27 AM.

tkrupa edited the summary of this revision. (Show Details)May 16 2018, 1:20 AM

Is there a machine instruction pass for combining?

I haven't been following this review, but I saw this question, and the answer is: yes.
See class MachineCombiner (pass name is shown as "Machine InstCombiner").

The addus looks right now, thanks.

test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
2585–2587 ↗(On Diff #146864)

Ah sorry I missed that.
I'm not quite sure if that means the pattern can never be folded into a single instruction, but if so I'm very much against autoupgrade. The code might not be much worse, but in 99.99% of all cases it's going to be worse than using the intrinsic.

tkrupa added inline comments.May 16 2018, 9:16 AM
test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
2585–2587 ↗(On Diff #146864)

It only fails to fold with fast-isel (one of the non-standard instruction selection options chosen with a compiler flag); it is still always being combined with default SelectionDAGISel.

sroland added inline comments.May 16 2018, 9:23 AM
test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
2585–2587 ↗(On Diff #146864)

Ok, we're not using that but I believe at some point we considered it (compile times are really important with jit, and that's my other problem with autoupgrade anyway, this is not going to help there but it might not be significant and I'm thinking that's just the price you have to pay for the compiler getting smarter).

About lib/Target/X86/X86ISelLowering.cpp:36183 - does it make sense to emit normal ADD/SUB without saturation when element type after truncation is larger than before extension?

test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
2585–2587 ↗(On Diff #146864)

Can fast-isel case be left like that then? If not, I could try to do the MachineCombiner folding (we don't currently have any combining like that for x86 from what I've seen) or abandon doing the AutoUpgrade for this particular intrinsic.

sroland added inline comments.May 17 2018, 12:04 PM
test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
2585–2587 ↗(On Diff #146864)

From my perspective for mesa, yes, it should be ok.

tkrupa updated this revision to Diff 148890.May 29 2018, 5:47 AM

Added a check for two constant operands. I'm still waiting for answer for my last question.

craig.topper added inline comments.Jul 16 2018, 3:35 PM
lib/Target/X86/X86ISelLowering.cpp
32275 ↗(On Diff #148890)

Shouldn't this be getSetCCSwappedOperands?

32290 ↗(On Diff #148890)

isEqualTo is overkill here. You can just compare Other->getOperand(0) == CondLHS

And it should be "Other." instead of "Other->" in most of this code. Technically they are equivalent, but we tend to use the . on SDValue.

32301 ↗(On Diff #148890)

Again you don't need isEqualTo.

craig.topper added inline comments.Jul 16 2018, 3:47 PM
lib/Target/X86/X86ISelLowering.cpp
36289 ↗(On Diff #148890)

Why can't you just use APInt::isIntN for unsigned and APInt::isSignedIntN for signed? Why do you need to extract the high bits yourself?

tkrupa updated this revision to Diff 155836.Jul 17 2018, 3:56 AM

Rebased and implemented suggested changes.

tkrupa updated this revision to Diff 155844.Jul 17 2018, 4:05 AM

Removed unnecessary comment.

This revision is now accepted and ready to land.Jul 17 2018, 10:03 PM
craig.topper requested changes to this revision.Jul 20 2018, 2:23 PM
craig.topper added inline comments.
lib/Target/X86/X86ISelLowering.cpp
36873 ↗(On Diff #155844)

isBuildVectorConstantSDNodes allows undef elements which will fail this cast. You need to add a skip for undef. Please add test cases.

This revision now requires changes to proceed.Jul 20 2018, 2:23 PM
sroland added inline comments.Jul 20 2018, 4:01 PM
include/llvm/IR/IntrinsicsX86.td
367 ↗(On Diff #155844)

FWIW I don't quite agree this is really a FIXME (not without having some appropriate replacement, like a generic llvm intrinsic for saturate arithmetic).
I'd take an intrinsic over tons of simple ir ops (on the vague hope it will get recognized again unless some other optimization messed it up) any day of the week.

But otherwise looks alright to me.

tkrupa updated this revision to Diff 156996.Jul 24 2018, 4:00 AM

Now the pattern is not detected if there are any undef elements in the constant operand.
I removed the tests added after fixing the bug which caused reversion of the patch - in the meantime the pattern conditions changed and the sequence used in those tests is now perfectly valid.
Should FIXME comments be removed?

If you remove the FIXMEs you need to replace them with a comment that says not to delete them. I have at various times ran a script or grep looking for unused intrinsics that we forgot to delet

But if we aren't going to delete them, I somewhat feel like we should continue using them in clang and focus on teaching InstCombine how to constant fold them and do other useful optimizations. Having a separate way of doing things for clang and other users of LLVM will lead to inconsistent optimization.

tkrupa updated this revision to Diff 159459.Aug 7 2018, 12:37 AM

Brought back signed intrinsics and added constant folding in InstCombine.

Better to split the signed/unsigned into separate patches since they've diverged so much?

lib/IR/AutoUpgrade.cpp
79 ↗(On Diff #159459)

This is version 8.0 now

tkrupa updated this revision to Diff 159872.Aug 9 2018, 2:15 AM
tkrupa retitled this revision from [X86] Lowering adds/addus/subs/subus intrinsics to native IR (LLVM part) to [X86] Lowering addus/subus intrinsics to native IR (LLVM part).

Split to two different patches for signed and unsigned intrinsics.
Corrected version number.

craig.topper added inline comments.Aug 9 2018, 10:17 AM
lib/Target/X86/X86ISelLowering.cpp
32900 ↗(On Diff #159872)

Why are v16i32 and v8i64 included here? We don't have saturating add/sub for those types do we?

tkrupa updated this revision to Diff 160300.Aug 13 2018, 1:14 AM
tkrupa marked an inline comment as done.
This revision is now accepted and ready to land.Aug 13 2018, 11:49 AM

Is clang part good to go too?

This revision was automatically updated to reflect the committed changes.
jgorbe added a subscriber: jgorbe.Aug 17 2018, 3:58 AM

Hi,

I think this change is breaking one of our builds. The attached reduced test case fails with the current trunk revision if built with "clang -x c -O2 -mavx -c crash.ii".

I think this change is breaking one of our builds. The attached reduced test case fails with the current trunk revision if built with "clang -x c -O2 -mavx -c crash.ii".

define void @d() {
entry:

%wide.load = load <4 x i16>, <4 x i16>* undef, align 2
%0 = sext <4 x i16> %wide.load to <4 x i32>
%1 = sub nsw <4 x i32> zeroinitializer, %0
%2 = icmp sgt <4 x i32> %1, <i32 -32768, i32 -32768, i32 -32768, i32 -32768>
%3 = select <4 x i1> %2, <4 x i32> %1, <4 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768>
%4 = icmp slt <4 x i32> %3, <i32 32767, i32 32767, i32 32767, i32 32767>
%5 = select <4 x i1> %4, <4 x i32> %3, <4 x i32> <i32 32767, i32 32767, i32 32767, i32 32767>
%6 = trunc <4 x i32> %5 to <4 x i16>
store <4 x i16> %6, <4 x i16>* undef, align 2
unreachable

}

I think I see the issue. Working on a fix.

Turns out this patch only tested one of the two new paths added to X86ISelLowering.cpp. I've removed the detectAddSubSatPattern which contained the bug. I think we may want to put it back, but only with appropriate tests.

srj added a subscriber: srj.Aug 23 2018, 5:53 PM
srj added a comment.EditedAug 23 2018, 6:39 PM

I'm trying to update Halide to generate IR that will be recognized by this patch (instead of calling the now-deprecated intrinsics), but having trouble with a somewhat degenerate-but-legal case.

If user code specifies a non-native vector width (e.g., paddusb with 8 lanes on sse2, instead of the native with of 16 lanes), our code would handle this by loading at the size requested, then combining vectors to the right native size. So our code would formerly emit something like

# Do a saturating unsigned add on two <8 x i8> vectors, 
# then widen to an <8 x i32> result
%20 = load <8 x i8>
%21 = load <8 x i8>
%22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%24 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %22, <16 x i8> %23) #5
%25 = shufflevector <16 x i8> %24, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

To work with this patch, I revised our code to emit inline code that should pattern-match properly (based on the new self-tests for the IR), something like:

%20 = load <8 x i8>
%21 = load <8 x i8>
%22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
# Here's the inline pattern that should match paddusb
%24 = add <16 x i8> %22, %23
%25 = icmp ugt <16 x i8> %22, %24
%26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>, <16 x i8> %24
#
%25 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

And, in fact, if I don't use any optimizer passes, this works perfectly. Unfortunately, the LLVM optimizer passes can do some rearranging of this, e.g. into a form something like this:

%20 = load <8 x i8>
%21 = load <8 x i8>
%22 = add <8 x i8> %20, %16
%23 = shufflevector <8 x i8> %22, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%24 = icmp ult <8 x i8> %22, %20
%25 = shufflevector <8 x i1> %24, <8 x i1> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, <16 x i8> %23
%27 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

...which no longer gets recognized as a pattern that produces paddusb, since the select no longer refers directly to the result of the compare (but rather to an intermediate shuffle).

Disabling all our optimizer passes 'fixes' this but that's obviously unsuitable as a solution. Could this pattern matching be made more robust?

In D46179#1211902, @srj wrote:

Disabling all our optimizer passes 'fixes' this but that's obviously unsuitable as a solution. Could this pattern matching be made more robust?

Would it help or hurt if we narrowed the select in IR to match the final return type and original operand types (I made the types smaller from your code just to make this easier to read):

define <2 x i8> @should_narrow_select(<2 x i8> %x, <2 x i1> %cmp) {
  %widex = shufflevector <2 x i8> %x, <2 x i8> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
  %widecmp = shufflevector <2 x i1> %cmp, <2 x i1> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
  %widesel = select <4 x i1> %widecmp, <4 x i8> <i8 -1, i8 -1, i8 undef, i8 undef>, <4 x i8> %widex
  %sel = shufflevector <4 x i8> %widesel, <4 x i8> undef, <2 x i32> <i32 0, i32 1>
  ret <2 x i8> %sel
}

-->

define <2 x i8> @should_narrow_select(<2 x i8> %x, <2 x i1> %cmp) {
  %narrowsel = select <2 x i1> %cmp, <2 x i8> <i8 -1, i8 -1>, <2 x i8> %x
  ret <2 x i8> %narrowsel
}
srj added a comment.Aug 24 2018, 10:54 AM

I took the liberty of opening a bug on this to simplify discussion: https://bugs.llvm.org/show_bug.cgi?id=38691

srj added a comment.Aug 24 2018, 11:01 AM

Would it help or hurt if we narrowed the select in IR to match the final return type and original operand types (I made the types smaller from your code just to make this easier to read):

Not sure I understand the question -- are you suggesting that this is code I should add, or that this is a transformation that LLVM would do under the hood?

If the former, I don't see how it would help, unless I made the function not-inlined (which would seem to compromise efficiency).

If the latter, I'm not qualified to answer as I know very little about the relevant internal bits of LLVM :-)

In D46179#1212714, @srj wrote:

Would it help or hurt if we narrowed the select in IR to match the final return type and original operand types (I made the types smaller from your code just to make this easier to read):

Not sure I understand the question -- are you suggesting that this is code I should add, or that this is a transformation that LLVM would do under the hood?

If the former, I don't see how it would help, unless I made the function not-inlined (which would seem to compromise efficiency).

If the latter, I'm not qualified to answer as I know very little about the relevant internal bits of LLVM :-)

Sorry it wasn't clear - I was suggesting the latter. The transforms that are occurring here are in instcombine, but we're missing a narrowing opportunity for the select. It does look like that will allow the backend to match saturation again (at least in my simple tests). Let's continue the discussion in the bug report since this review is closed.

@spatel, would the end result be completely removing all of the widening from v8i8 to v16i8 that was done in the incoming IR? Leaving only v8i8 types?

@spatel, would the end result be completely removing all of the widening from v8i8 to v16i8 that was done in the incoming IR? Leaving only v8i8 types?

Yes - I'll describe/show the potential transform in the bug report.