Page MenuHomePhabricator

[DAG] Match USUBSAT patterns through zext/trunc
ClosedPublic

Authored by RKSimon on Oct 26 2016, 7:29 AM.

Details

Summary

This patch handles usubsat patterns hidden through zext/trunc and uses the getTruncatedUSUBSAT helper to determine if the USUBSAT can be correctly performed in the truncated form:

zext(x) >= y ? x - trunc(y) : 0 --> usubsat(x,trunc(umin(y,SatLimit)))
zext(x) >  y ? x - trunc(y) : 0 --> usubsat(x,trunc(umin(y,SatLimit)))

Based on original examples:

void foo(unsigned short *p, int max, int n) {
    int i;
    unsigned m;
    for (i = 0; i < n; i++) {
        m = *--p;
        *p = (unsigned short)(m >= max ? m-max : 0);
    }
}

Diff Detail

Event Timeline

yulia_koval retitled this revision from to [X86] New pattern to generate PSUBUS from SELECT.
yulia_koval updated this object.
yulia_koval added reviewers: bkramer, nadav, RKSimon, spatel.
RKSimon edited edge metadata.Nov 8 2016, 7:56 AM

Please can you regenerate the psubus.ll tests with utils/update_llc_test_checks.py

RKSimon added inline comments.Nov 8 2016, 8:05 AM
lib/Target/X86/X86ISelLowering.cpp
27379 ↗(On Diff #75873)

If you replace this with VT.getScalarSizeInBits() then clang-format might be able to do a better job

yulia_koval edited edge metadata.
  1. Changed regex-based test to generated from update_llc_test_checks.py script
  2. Replaced getScalarType().getSizeInBits() with VT.getScalarSizeInBits()

Please can you rebase to trunk latest? I've added the current codegen from your tests and your patch appears to introduce scalarization regressions for SSE targets.

yulia_koval updated this revision to Diff 83614.Jan 9 2017, 5:15 AM

This optimization requires the umin instruction. It didn't exist before SSE4.1, so I added requirement for SSE4 for it and added new target to test. I haven't found profitable equivalent for umin in sse\sse2 for this situation.

RKSimon added inline comments.Jan 9 2017, 5:31 AM
lib/Target/X86/X86ISelLowering.cpp
28958 ↗(On Diff #83614)

SSE2 has some min/max support - replace the hasSSE41 requirement with TLI.isOperationLegalOrCustom() ?

I updated the patch to trunk, because test output changed due to rL292479 optimization. I failed to find a better check, based on TLI.isLegal.., because it doesn't requires exact type. For example, in case of SSE4.1, isOperationLegalOrCustom(MIN, v8i32) returns false and disables the optimization. However, the transformation works fine as the illegal operation is split into two v4i32 operations. The hasSSE41 check is sufficient, because it includes both v8i16 and v4i32 for each integer type. i8(SSE hass i8 based umin) can't be used in this optimisation, because the ExtType is always larger then psubus type and can be minimum i16.

RKSimon added inline comments.Jan 26 2017, 5:27 AM
lib/Target/X86/X86ISelLowering.cpp
29378 ↗(On Diff #85883)

clang-format

Sorry, fixed clang-format.

spatel edited edge metadata.Feb 10 2017, 11:59 AM

It's difficult for me to see what's going on in these tests because there are a bunch of irrelevant load/store/gep/bitcast instructions. Can you remove those?

If you do that, the root of the problem is very similar to:
https://reviews.llvm.org/rL286776

Ie, if we recognized the umin/umax patterns in IR as ISD::UMIN/ISD::UMAX when we created the DAG, the subsequent matching to the x86-specific PSUBUS might already happen with the existing lowering?

It's difficult for me to see what's going on in these tests because there are a bunch of irrelevant load/store/gep/bitcast instructions. Can you remove those?

If you do that, the root of the problem is very similar to:
https://reviews.llvm.org/rL286776

Ie, if we recognized the umin/umax patterns in IR as ISD::UMIN/ISD::UMAX when we created the DAG, the subsequent matching to the x86-specific PSUBUS might already happen with the existing lowering?

I fixed tests(https://reviews.llvm.org/D32643). New revision is based on this updated tests revision.

We have (i16) (a>=b ? a - b : 0), where both a and b are 32 bit. We can't apply max here, because unsigned max between something and zero doesn't make sense. And we can't use signed max, because we can't fit 32bit unsigned values in it.

On the other hand, the result we select is 16bit PSUBUS, which only exists for 16bit and 8bit vector elements, not for 32bit. This optimization truncates vectors to 16bit, using the information, that it was zero extended and this vector is on the left side of the cmp. We use that information to substitute b with min(b, max_16bit_type). and then we can truncate a(it can be truncated, because it is zero extended value) and b(because we truncated it properly with our min). In this case the result of the sub will be valid when a >=b and not significant if not(the actual result is zero). This pattern can be then recognized as PSUBUS instruction.

This transformation is profitable only if PSUBUS exists. And this function is the place. where PSUBUS is selected and we can be 100% sure, that it will be profitable. This transformation is not profitable in general.

So in pseudo code it will be:
a >= b ? a - b : 0
transformed to:
trunc(a) >= min(b, 0xFFFF) ? trunc(a) - min(b, 0xFFFF) : 0 ->the result of the patch
This result is then transformed to PSUBUS by existing code.

I think we need to take a step back here. The tests are assuming a specific form of IR, but I'm not sure it's canonical. If we change this upstream (in InstCombine), the transform may fail to activate when you hoped it would. It's also possible that generic DAG combines could interfere with the pattern you're expecting. See rL301781 as an example. In either case, this seems like a fragile solution because the pattern can change above this.

http://rise4fun.com/Alive/5e4
Unless I made a mistake in transcribing that, the 1st form is what you're expecting in these tests. The 2nd form might be considered better since it includes a canonical 'max'.

Can you test this patch with more examples like this to see if it behaves as expected:

define i16 @scalar(i16 %x, i32 %y) nounwind {
  %lhs = zext i16 %x to i32
  %cond = icmp ugt i32 %lhs, %y
  %sub = sub i32 %lhs, %y
  %truncsub = trunc i32 %sub to i16
  %res = select i1 %cond, i16 %truncsub, i16 zeroinitializer
  ret i16 %res
}

define i16 @scalar_max(i16 %x, i32 %y) nounwind {
  %lhs = zext i16 %x to i32
  %cond = icmp ugt i32 %lhs, %y
  %umax = select i1 %cond, i32 %lhs, i32 %y
  %sub = sub i32 %umax, %y
  %truncsub = trunc i32 %sub to i16
  ret i16 %truncsub
}

define <8 x i16> @test15(<8 x i16> %x, <8 x i32> %y) nounwind {
  %lhs = zext <8 x i16> %x to <8 x i32>
  %cond = icmp ugt <8 x i32> %lhs, %y
  %sub = sub <8 x i32> %lhs, %y
  %truncsub = trunc <8 x i32> %sub to <8 x i16>
  %res = select <8 x i1> %cond, <8 x i16> %truncsub, <8 x i16> zeroinitializer
  ret <8 x i16> %res
}

define <8 x i16> @test15_max(<8 x i16> %x, <8 x i32> %y) nounwind {
  %lhs = zext <8 x i16> %x to <8 x i32>
  %cond = icmp ugt <8 x i32> %lhs, %y
  %umax = select <8 x i1> %cond, <8 x i32> %lhs, <8 x i32> %y
  %sub = sub <8 x i32> %umax, %y
  %truncsub = trunc <8 x i32> %sub to <8 x i16>
  ret <8 x i16> %truncsub
}

define <8 x i8> @test15_narrow(<8 x i8> %x, <8 x i16> %y) nounwind {
  %lhs = zext <8 x i8> %x to <8 x i16>
  %cond = icmp ugt <8 x i16> %lhs, %y
  %sub = sub <8 x i16> %lhs, %y
  %truncsub = trunc <8 x i16> %sub to <8 x i8>
  %res = select <8 x i1> %cond, <8 x i8> %truncsub, <8 x i8> zeroinitializer
  ret <8 x i8> %res
}

define <8 x i8> @test15_narrow_max(<8 x i8> %x, <8 x i16> %y) nounwind {
  %lhs = zext <8 x i8> %x to <8 x i16>
  %cond = icmp ugt <8 x i16> %lhs, %y
  %umax = select <8 x i1> %cond, <8 x i16> %lhs, <8 x i16> %y
  %sub = sub <8 x i16> %umax, %y
  %truncsub = trunc <8 x i16> %sub to <8 x i8>
  ret <8 x i8> %truncsub
}

I think we need to take a step back here. The tests are assuming a specific form of IR, but I'm not sure it's canonical. If we change this upstream (in InstCombine), the transform may fail to activate when you hoped it would. It's also possible that generic DAG combines could interfere with the pattern you're expecting. See rL301781 as an example. In either case, this seems like a fragile solution because the pattern can change above this.

http://rise4fun.com/Alive/5e4
Unless I made a mistake in transcribing that, the 1st form is what you're expecting in these tests. The 2nd form might be considered better since it includes a canonical 'max'.

Can you test this patch with more examples like this to see if it behaves as expected:

define i16 @scalar(i16 %x, i32 %y) nounwind {
  %lhs = zext i16 %x to i32
  %cond = icmp ugt i32 %lhs, %y
  %sub = sub i32 %lhs, %y
  %truncsub = trunc i32 %sub to i16
  %res = select i1 %cond, i16 %truncsub, i16 zeroinitializer
  ret i16 %res
}

define i16 @scalar_max(i16 %x, i32 %y) nounwind {
  %lhs = zext i16 %x to i32
  %cond = icmp ugt i32 %lhs, %y
  %umax = select i1 %cond, i32 %lhs, i32 %y
  %sub = sub i32 %umax, %y
  %truncsub = trunc i32 %sub to i16
  ret i16 %truncsub
}

define <8 x i16> @test15(<8 x i16> %x, <8 x i32> %y) nounwind {
  %lhs = zext <8 x i16> %x to <8 x i32>
  %cond = icmp ugt <8 x i32> %lhs, %y
  %sub = sub <8 x i32> %lhs, %y
  %truncsub = trunc <8 x i32> %sub to <8 x i16>
  %res = select <8 x i1> %cond, <8 x i16> %truncsub, <8 x i16> zeroinitializer
  ret <8 x i16> %res
}

define <8 x i16> @test15_max(<8 x i16> %x, <8 x i32> %y) nounwind {
  %lhs = zext <8 x i16> %x to <8 x i32>
  %cond = icmp ugt <8 x i32> %lhs, %y
  %umax = select <8 x i1> %cond, <8 x i32> %lhs, <8 x i32> %y
  %sub = sub <8 x i32> %umax, %y
  %truncsub = trunc <8 x i32> %sub to <8 x i16>
  ret <8 x i16> %truncsub
}

define <8 x i8> @test15_narrow(<8 x i8> %x, <8 x i16> %y) nounwind {
  %lhs = zext <8 x i8> %x to <8 x i16>
  %cond = icmp ugt <8 x i16> %lhs, %y
  %sub = sub <8 x i16> %lhs, %y
  %truncsub = trunc <8 x i16> %sub to <8 x i8>
  %res = select <8 x i1> %cond, <8 x i8> %truncsub, <8 x i8> zeroinitializer
  ret <8 x i8> %res
}

define <8 x i8> @test15_narrow_max(<8 x i8> %x, <8 x i16> %y) nounwind {
  %lhs = zext <8 x i8> %x to <8 x i16>
  %cond = icmp ugt <8 x i16> %lhs, %y
  %umax = select <8 x i1> %cond, <8 x i16> %lhs, <8 x i16> %y
  %sub = sub <8 x i16> %umax, %y
  %truncsub = trunc <8 x i16> %sub to <8 x i8>
  ret <8 x i8> %truncsub
}

Well, first 2 tests don't work with my patch because PSUBUS is a vector instruction. Test15 is working perfectly. Test15_max doesn't work, because I didn't try to recognize this pattern(the one you showed in a link), after optimization, because currently(for at least a year) llvm generates the code like in my tests, for example for psubus.c:

void foo(unsigned short *p, int max, int n)
{
int i;
unsigned m;
for (i = 0; i < n; i++) {

m = *--p;
*p = (unsigned short)(m >= max ? m-max : 0);

}
}

IR will be:

%10 = zext <8 x i16> %reverse to <8 x i32>
%11 = icmp ugt <8 x i32> %broadcast.splat16, %10
%12 = sub <8 x i32> %10, %broadcast.splat16
%13 = trunc <8 x i32> %12 to <8 x i16>
%14 = select <8 x i1> %11, <8 x i16> zeroinitializer, <8 x i16> %13

I see, that optimization you provided in http://rise4fun.com/Alive/5e4 is valid, but it extends from i16 to i32 which may trigger some regressions, because it switches to wider operations. Maybe something like this will appear in the future, but currently the pattern above is generated. Also, whether the pattern generated from this particular C code will be changed or not, this patch will still be useful for this IR pattern, if it appears. I can add recognition of the test15_max pattern as well, should I do it?

Test15_narrow converted to psubus before my patch and test15_narrow_max converted to max before my patch. The only test, changed by my patch is @test15.

Looks like rL301781 doesn't affect my patch at any tests..
Also if some optimization replace select(sub) with sub(select), won't it ruin any other psubus pattern as well?

spatel added a comment.May 2 2017, 9:31 AM

Well, first 2 tests don't work with my patch because PSUBUS is a vector instruction. Test15 is working perfectly. Test15_max doesn't work, because I didn't try to recognize this pattern(the one you showed in a link), after optimization, because currently(for at least a year) llvm generates the code like in my tests, for example for psubus.c:

It is irrelevant how long the current IR has been produced. It is also irrelevant what form the current IR takes until we agree that it is the canonical form. Let me try to explain.

void foo(unsigned short *p, int max, int n)
{
int i;
unsigned m;
for (i = 0; i < n; i++) {

m = *--p;
*p = (unsigned short)(m >= max ? m-max : 0);

}
}

IR will be:

%10 = zext <8 x i16> %reverse to <8 x i32>
%11 = icmp ugt <8 x i32> %broadcast.splat16, %10
%12 = sub <8 x i32> %10, %broadcast.splat16
%13 = trunc <8 x i32> %12 to <8 x i16>
%14 = select <8 x i1> %11, <8 x i16> zeroinitializer, <8 x i16> %13

This looks good. I'm glad we can vectorize this. :)

But do you agree that it is reasonable to write this function in C as:

void goo(unsigned short *p, int max, int n) {
  int i;
  unsigned m;
  for (i = 0; i < n; i++) {
    m = *--p;
    unsigned umax = m > max ? m : max;
    *p = (unsigned short)(umax - max);
  }
}

If yes, then I think you would also agree that LLVM should optimize that code in exactly the same way (to eventually produce a psubus when compiled for x86 -msse4.1). Now, if you agree with that, then the question is where should we recognize that these 2 functions are logically equivalent? I'm saying that it should occur in IR because that's the main job of IR: to translate source code into a canonical form that the backend can then optimize to the best machine-code for a given target.

I see, that optimization you provided in http://rise4fun.com/Alive/5e4 is valid, but it extends from i16 to i32 which may trigger some regressions, because it switches to wider operations. Maybe something like this will appear in the future, but currently the pattern above is generated. Also, whether the pattern generated from this particular C code will be changed or not, this patch will still be useful for this IR pattern, if it appears. I can add recognition of the test15_max pattern as well, should I do it?

No. Until we decide what the canonical IR for this pattern looks like, you may be adding a useless transform (dead code) to the backend. "The future" where we no longer produce the IR pattern that this patch is looking for could be today, so let's not prematurely add a backend optimization until we know that pattern will actually exist. You're suggesting that we match multiple patterns in the backend to produce psubus. I'm suggesting that if we can canonicalize this pattern in IR, then we will more efficiently optimize it in the backend regardless of how it was written in source. Does this make sense?

Also if some optimization replace select(sub) with sub(select), won't it ruin any other psubus pattern as well?

It is certainly possible. We should investigate the current patterns that match to psubus. As shown in your examples in this patch and my follow-up test15_max, the matching of psubus is not complete. Will we break the existing matching by canonicalizing the IR? If yes, we should add those transforms first to avoid regressions.

So the steps to fix this properly:

  1. Decide what is the canonical IR.
  2. Add codegen tests for that IR (this should likely include more than x86).
  3. Optimize those patterns in the backend (Do other targets have instructions equivalent to psubus? If yes, a generic DAGCombiner transform should be added.)
  4. Add the canonicalization to IR.

This should ensure that no matter how the source is written, we'll get optimal codegen for all targets, and we'll do that efficiently.

But do you agree that it is reasonable to write this function in C as:

void goo(unsigned short *p, int max, int n) {
  int i;
  unsigned m;
  for (i = 0; i < n; i++) {
    m = *--p;
    unsigned umax = m > max ? m : max;
    *p = (unsigned short)(umax - max);
  }
}

This function's IR is

%10 = zext <8 x i16> %reverse to <8 x i32>
%11 = icmp ult <8 x i32> %broadcast.splat18, %10
%12 = select <8 x i1> %11, <8 x i32> %10, <8 x i32> %broadcast.splat18
%13 = sub <8 x i32> %12, %broadcast.splat18
%14 = trunc <8 x i32> %13 to <8 x i16>

I agree, that it looks better as canonical, then my example, when trunc is in between the pattern's instructions. What do you think about defining trunc(sub(select())) canonical for this pattern? Then we can canonicalize my initial pattern using the InstCombine patch I attached(https://reviews.llvm.org/D33118) - it looks like this way the .ll code generated for both of them would be the same. After this transformation, the initial pattern can be transformed into 32bit sub(max). I haven't yet investigated how to transform it into psubus, but at least max is more profitable then currently generated instructions.

What do you think about this idea?

I agree, that it looks better as canonical, then my example, when trunc is in between the pattern's instructions. What do you think about defining trunc(sub(select())) canonical for this pattern? Then we can canonicalize my initial pattern using the InstCombine patch I attached(https://reviews.llvm.org/D33118) - it looks like this way the .ll code generated for both of them would be the same. After this transformation, the initial pattern can be transformed into 32bit sub(max). I haven't yet investigated how to transform it into psubus, but at least max is more profitable then currently generated instructions.

What do you think about this idea?

I think it's a good idea to form canonical min/max in IR when possible. There is precedent for a canonicalization that is similar to this one:
http://lists.llvm.org/pipermail/llvm-dev/2016-November/106868.html

But please send a note to llvm-dev requesting feedback for this canonicalization. We want to make sure there are no objections before proceeding.

I think the example can be minimized to this:
http://rise4fun.com/Alive/qBn

(ie, the leading zext is not needed, and this pattern is not limited to vectors)

zvi added a subscriber: delena.Jun 29 2017, 6:04 AM
zvi added a subscriber: zvi.
RKSimon edited edge metadata.Oct 26 2017, 2:31 PM

@yulia_koval What is the state of this after D37510 and D37534?

yulia_koval added a comment.EditedOct 27 2017, 9:56 PM

@yulia_koval What is the state of this after D37510 and D37534?

The one additional patch is required to apply the changes - cannonicalization part. I will submit it soon.

Did D41480 affect anything with these tests? We still need to handle 'umin' patterns in IR and update the tests here with canonical IR forms to show psubus codegen?

Did D41480 affect anything with these tests? We still need to handle 'umin' patterns in IR and update the tests here with canonical IR forms to show psubus codegen?

Looks like all possible umin patterns are already handled in existing patterns we added for max. The last tests in psubus.ll is also cannonical cmp-select-sub sequence. What should I add which is missing?

We've added custom lowering support for all (US)MIN/MAX vector types, plus we have improved saturation packing so you might be able to relax the SSE41 requirement - please can you check the codegen on SSE2/SSSE3 if you allow the combines?

Did D41480 affect anything with these tests? We still need to handle 'umin' patterns in IR and update the tests here with canonical IR forms to show psubus codegen?

Looks like all possible umin patterns are already handled in existing patterns we added for max. The last tests in psubus.ll is also cannonical cmp-select-sub sequence. What should I add which is missing?

What does this comment in InstCombineSelect refer to?
/// TODO: Also support a - UMIN(a,b) patterns.

Also, there's a proposal to remove the intrinsics now - D44785. How does that relate to whatever is left to do here?

Also, there's a proposal to remove the intrinsics now - D44785. How does that relate to whatever is left to do here?

It turns out there is a canonical represention of subus in place. D44785 patch does not take it into account - I'm going to modify it so that there's no conflict.

@yulia_koval Are you still working on this?

@yulia_koval Are you still working on this?

It seems to continue in D33118. Mentioning it so Phabricator makes the link.

kristina removed a subscriber: kristina.Sep 8 2018, 8:59 AM

What does this comment in InstCombineSelect refer to?

/// TODO: Also support a - UMIN(a,b) patterns.

This was a todo to investigate possibilities for generating min patterns, but seems like all of them are already handled, so we can just remove this TODO.

Also, there's a proposal to remove the intrinsics now - D44785. How does that relate to whatever is left to do here?

This is already commited, I think. This was converting intrinsics to cannonical form.

It seems to continue in D33118. Mentioning it so Phabricator makes the link.

This is old version of same revision, abandoned it.

@yulia_koval Are you still working on this?

No, I think this issue is finished..

RKSimon commandeered this revision.Jan 12 2021, 2:30 PM
RKSimon edited reviewers, added: yulia_koval; removed: RKSimon.

I've been working on PR40111 to make most x86 ISD::USUBSAT code generic and moved into DAGCombine - I'll look at this as well.

RKSimon planned changes to this revision.Jan 17 2021, 7:16 AM
RKSimon updated this revision to Diff 324279.Feb 17 2021, 5:31 AM
RKSimon retitled this revision from [X86] New pattern to generate PSUBUS from SELECT to [DAG] Match USUBSAT patterns through zext/trunc.
RKSimon edited the summary of this revision. (Show Details)

Converted patch to a generic DAGCombine fold

Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2021, 5:31 AM
RKSimon added inline comments.Feb 17 2021, 5:33 AM
llvm/test/CodeGen/X86/psubus.ll
563

Still investigating these cases (test14 + test16) - annoyingly alive2 is broken which is slowing me down :-( - but it should be just a similar pattern to this patch.

craig.topper added inline comments.Feb 19 2021, 5:45 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9875

x and y don't have the same bitwidth so the result of the fold isn't usubsat x, y. Is it usubsat x, trunc(umin(y, SatLimit))?

RKSimon updated this revision to Diff 325193.Feb 20 2021, 5:25 AM
RKSimon edited the summary of this revision. (Show Details)

Fix comment

This revision is now accepted and ready to land.Feb 20 2021, 10:30 AM
This revision was automatically updated to reflect the committed changes.