This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] use UADDO to optimize saturated unsigned add
ClosedPublic

Authored by spatel on Sep 11 2018, 7:02 AM.

Details

Summary

This is a preliminary step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613

If we have an 'add' instruction that sets flags, we can use that to eliminate an explicit compare instruction or some other instruction (cmn) that sets flags for use in the later select.

As shown in the unchanged tests that use 'icmp ugt %x, %a', we're effectively reversing an IR icmp canonicalization that replaces a variable operand with a constant:
https://rise4fun.com/Alive/V1Q

But we're not using 'uaddo' in those cases via DAG transforms. This happens in CGP after D8889 without checking target lowering to see if the op is supported. That's why AArch64 only shows diffs for i32/i64. The existing codegen for the CGP-altered i8/i16 tests suggests that AArch could do better by marking the other types as 'custom', but that's outside the scope of this patch.

Diff Detail

Event Timeline

spatel created this revision.Sep 11 2018, 7:02 AM

That's why AArch64 only shows diffs for i32/i64.

Reading that again, that's likely to confuse you. :) More detailed explanation:

AArch shows 'uaddo' codegen for the i8/i16/i32/i64 test variants with "using_cmp_sum" in the title. That's the pattern that CGP matches as an unsigned saturated add and converts to uaddo without checking target capabilities.

This patch is gated by isOperationLegalOrCustom(ISD::UADDO, VT), so we see only see AArch diffs for i32/i64 in the tests with "using_cmp_notval" in the title (unlike x86 which sees improvements for all sizes because all sizes are 'custom'). But the AArch code (like x86) looks better when translated to 'uaddo' in all cases. So someone that is involved with AArch may want to set i8/i16 to 'custom' for UADDO, so this patch will fire on those tests.

Another possibility given the existing behavior: we could remove the legal-or-custom check altogether because we're assuming that a UADDO sequence is canonical/optimal before we ever reach here. But that seems like a bug to me. If the target doesn't have an add-with-flags op, then it's not likely that we'll get optimal DAG combining using a UADDO node. This is similar justification for why we don't canonicalize IR to the overflow math intrinsic sibling (llvm.uadd.with.overflow) for UADDO in the first place.

lebedev.ri added inline comments.Sep 12 2018, 9:13 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7312–7323

Seems like some of this can be a preparatory NFC commit?

7332–7333

I feel like this needs a comment with IR example..

spatel updated this revision to Diff 165176.Sep 12 2018, 4:29 PM
spatel marked 2 inline comments as done.

Patch updated:

  1. Removed NFC formatting (rL342095).
  2. Added IR example to code comment. This transform has a lot of predicates, and there's no direct twin in IR, so I'm not sure this actually makes it better...but let me know. :)

Do we somehow enforce that in %r = select i1 %c, i32 -1, i32 %a, -1 is in the middle?
If not, we miss at least one case i think:

$ ~/src/alive/alive.py /tmp/test.txt 
----------------------------------------
Optimization: 1
Precondition: true
  %a = add i32 %x, 42
  %c = icmp ugt i32 %x, -43
  %r = select i1 %c, i32 -1, i32 %a
=>
  %c2 = icmp ult i32 %x, -43
  %r = select i1 %c2, i32 %a, i32 -1

Done: 1
Optimization is correct!

I thought there was some more case with inverted/negated 42 / -43 but i can't find it right now.

spatel added a subscriber: nlopes.Sep 21 2018, 10:09 AM

Do we somehow enforce that in %r = select i1 %c, i32 -1, i32 %a, -1 is in the middle?
If not, we miss at least one case i think:

That's correct. I think there are also 4 swapped variants for the pattern with a variable and a 'not' op (online version of Alive looks dead? cc'ing @nlopes):

Optimization: swap1
Precondition: true
  %noty = xor i32 %y, -1
  %a = add %x, %y
  %c = icmp ugt i32 %x, %noty
  %r = select i1 %c, i32 -1, i32 %a
=>
  %c2 = icmp ugt i32 %noty, %x
  %r = select i1 %c2, i32 %a, i32 -1

Done: 1
Optimization is correct!

----------------------------------------
Optimization: swap2
Precondition: true
  %noty = xor i32 %y, -1
  %a = add %x, %y
  %c = icmp ugt i32 %x, %noty
  %r = select i1 %c, i32 -1, i32 %a
=>
  %c2 = icmp ult i32 %x, %noty
  %r = select i1 %c2, i32 %a, i32 -1

Done: 1
Optimization is correct!

----------------------------------------
Optimization: swap3
Precondition: true
  %noty = xor i32 %y, -1
  %a = add %x, %y
  %c = icmp ugt i32 %x, %noty
  %r = select i1 %c, i32 -1, i32 %a
=>
  %c2 = icmp ult i32 %noty, %x
  %r = select i1 %c2, i32 -1, i32 %a

Done: 1
Optimization is correct!

My plan is to canonicalize all of the patterns in IR. If I'm seeing it correctly, there shouldn't be anything blocking those canonicalizations because we only try to form the uaddo here when the cmp (setcc) has one use (the select). That should let us get by with just the basic matching here in the backend. For example, the pattern with a variable should never reach the backend with a 'not' op because we can always shrink it in IR:

%noty = xor i8 %y, -1
%a = add i8 %x, %y
%c = icmp ugt i8 %x, %noty
%r = select i1 %c, i8 -1, i8 %a

>

%a = add i8 %x, %y
%c = icmp ugt i8 %x, %a
%r = select i1 %c, i8 -1, i8 %a
lebedev.ri accepted this revision.Sep 21 2018, 10:41 AM

Do we somehow enforce that in %r = select i1 %c, i32 -1, i32 %a, -1 is in the middle?
If not, we miss at least one case i think:

That's correct. I think there are also 4 swapped variants for the pattern with a variable and a 'not' op (online version of Alive looks dead? cc'ing @nlopes):
...

My plan is to canonicalize all of the patterns in IR. If I'm seeing it correctly, there shouldn't be anything blocking those canonicalizations because we only try to form the uaddo here when the cmp (setcc) has one use (the select). That should let us get by with just the basic matching here in the backend.

Ok, sounds good. At worst, this can be extended.
I'm not sure why there is no x86 test coverage.
Won't [[ https://www.felixcloutier.com/x86/ADC.html | ADC ]] work for this purpose?
[[ https://www.felixcloutier.com/x86/ADD.html#description | ADD evaluates the result for both signed and unsigned integer operands and sets the CF and OF flags to indicate a carry (overflow) in the signed or unsigned result, respectively ]], too.
Some plumbing missing?

I think this looks good, but maybe wait a bit just in case someone else wants to comment.

This revision is now accepted and ready to land.Sep 21 2018, 10:41 AM

Do we somehow enforce that in %r = select i1 %c, i32 -1, i32 %a, -1 is in the middle?
If not, we miss at least one case i think:

That's correct. I think there are also 4 swapped variants for the pattern with a variable and a 'not' op (online version of Alive looks dead? cc'ing @nlopes):
...

My plan is to canonicalize all of the patterns in IR. If I'm seeing it correctly, there shouldn't be anything blocking those canonicalizations because we only try to form the uaddo here when the cmp (setcc) has one use (the select). That should let us get by with just the basic matching here in the backend.

Ok, sounds good. At worst, this can be extended.
I'm not sure why there is no x86 test coverage.
Won't [[ https://www.felixcloutier.com/x86/ADC.html | ADC ]] work for this purpose?
[[ https://www.felixcloutier.com/x86/ADD.html#description | ADD evaluates the result for both signed and unsigned integer operands and sets the CF and OF flags to indicate a carry (overflow) in the signed or unsigned result, respectively ]], too.
Some plumbing missing?

I think this looks good, but maybe wait a bit just in case someone else wants to comment.

Thanks! I'm not following the 'adc' suggestion though. That uses the carry from a previous op in the addition. What would that code sequence look like?

I'm not following the 'adc' suggestion though. That uses the carry from a previous op in the addition. What would that code sequence look like?

Uhm, after thinking a bit more, i think we'd only care about ADD? I haven't really thought about it though.
But in ADC case, i guess the same as with ADD, but prefixed with [[ https://www.felixcloutier.com/x86/CLC.html | CLC ]] -> no point in using ADC in the first place.

I'm not following the 'adc' suggestion though. That uses the carry from a previous op in the addition. What would that code sequence look like?

Uhm, after thinking a bit more, i think we'd only care about ADD? I haven't really thought about it though.
But in ADC case, i guess the same as with ADD, but prefixed with [[ https://www.felixcloutier.com/x86/CLC.html | CLC ]] -> no point in using ADC in the first place.

For completeness sake, i was *thinking* of something like: https://godbolt.org/z/uoShW6
But it is totally possible that it makes no sense :)

For completeness sake, i was *thinking* of something like: https://godbolt.org/z/uoShW6

Yes - that's the kind of codegen we're trying for - see the x86 test for unsigned_sat_constant_i32_using_cmp_notval().
But I think you have overflow and carry mixed up:
https://stackoverflow.com/questions/19301498/carry-flag-auxiliary-flag-and-overflow-flag-in-assembly

Ignoring all of the missing canonicalizations for the moment, this patch will give us the same output as gcc:
https://godbolt.org/z/dYnXhf

(cmovnc is the same as cmovae)

This revision was automatically updated to reflect the committed changes.