Page MenuHomePhabricator

[InstCombine] Signed saturation patterns
ClosedPublic

Authored by dmgreen on Oct 8 2019, 9:00 AM.

Details

Summary

This adds a instcombine match for code that attempts to perform signed saturating arithmetic by casting to a higher type. For example:
https://godbolt.org/z/9knBnP
As can be seen the unsigned cases are already matched, but signed are not. Adding them for these cases involves matching the min(max(add a b)) nodes, with proper truncs and extends.

There is some work in D68643 to make the default lowering for these better. With that I believe that the intrinsic is a better choice than extending into a larger type (although it's obviously hard to tell for all architectures). This also helps with vectorization, especially in dsp routines that often want to use saturating arithmetic.

This also adds a m_SpecificAPInt matcher which seemed to be useful, similar to m_SpecificInt but taking an APInt for cases like this where using an APInt make sense.

  %a_wide = sext i4 %a to i8
  %b_wide = sext i4 %b to i8
  %add = add i8 %b_wide, %a_wide
  %should_not_clamp_high = icmp slt i8 %add, 7
  %clamped_high = select i1 %should_not_clamp_high, i8 %add, i8 7
  %should_not_clamp_low = icmp sgt i8 %clamped_high, -8
  %clamped_low = select i1 %should_not_clamp_low, i8 %clamped_high, i8 -8
  %res = trunc i8 %clamped_low to i4
  ret i4 %res
=>
  %res = sadd_sat i4 %a, %b
  ret i4 %res
  %b_wide = sext i4 %b to i8
  %a_wide = sext i4 %a to i8
  %add = add i8 %b_wide, %a_wide
  %should_not_clamp_high = icmp slt i8 %add, 7
  %clamped_high = select i1 %should_not_clamp_high, i8 %add, i8 7
  %should_not_clamp_low = icmp sgt i8 %clamped_high, -8
  %clamped_low = select i1 %should_not_clamp_low, i8 %clamped_high, i8 -8

Done: 1
Optimization is correct!

Diff Detail

Event Timeline

dmgreen created this revision.Oct 8 2019, 9:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 8 2019, 9:00 AM
Herald added a subscriber: hiraditya. · View Herald Transcript

Do we already form these saturating intrinsics in instcombine, if they are in their native format?
Can that code be extended to also use this knowledge about truncation?

Hello.

Can you explain what you mean by "native format"? Do you mean without the extends/truncs, as a different way of specifying them? (I think the problem at least from C is dealing with overflowing arithmetic being undefined. If you extend at least one bit then the arithmetic can't overflow, so you can do the min/max like it's done here).

I don't think there is anywhere in instcombine that currently forms a sadd_sat or ssub_sat (as opposed to uadd_sat or usub_sat), unless it's from an existing sadd_sat. We do form uadd_sat as in rL357012 and usub_sat from selects.

I really just need some way to generate sadd_sats for vectorisation. If there's a better way than this, I'm all ears :)

Hello.

Can you explain what you mean by "native format"? Do you mean without the extends/truncs, as a different way of specifying them?

Err, i meant native bitwidth, so yes, without any extends/truncs.

(I think the problem at least from C is dealing with overflowing arithmetic being undefined. If you extend at least one bit then the arithmetic can't overflow, so you can do the min/max like it's done here).

True.
This highlights that we also might want to form it from .with.overflow() intrinsics.

I don't think there is anywhere in instcombine that currently forms a sadd_sat or ssub_sat (as opposed to uadd_sat or usub_sat), unless it's from an existing sadd_sat. We do form uadd_sat as in rL357012 and usub_sat from selects.

I really just need some way to generate sadd_sats for vectorisation. If there's a better way than this, I'm all ears :)

hsaito added a subscriber: hsaito.Oct 8 2019, 2:26 PM

We do form uadd_sat as in rL357012 and usub_sat from selects.

I really just need some way to generate sadd_sats for vectorisation. If there's a better way than this, I'm all ears :)

If we just need vectorization with saturating add to happen, we just need a better pattern matcher utility. There should be benefits for going to intrinsic that compensates for the possible loss of optimizations that we might get with a sequence of Instructions. Is this (i.e., saturating add/sub intrinsic as the canonical form) something already discussed in llvm-dev that I missed? When I was involved in the "vector idioms" discussion few years ago, general consensus (among 10+ people interested in that topic) was that we need better pattern matchers than pushing for vector idiom specific intrinsics. That's why I'm asking.

The same question also applies to rL357012.

nikic added a comment.Oct 8 2019, 2:49 PM

@hsaito We've put quite a bit of effort into making sure that saturating intrinsics optimize as well or better than expanded IR sequences, which is why they are indeed considered canonical IR. Whether intrinsics are canonical needs to be decided on a case by case basis, there is no general rule about it. For example the unsigned mul overflow intrinsic is considered canonical (for obvious reasons -- the alternatives tend to be much more expensive) and is formed in InstCombine. Unsigned add/sub overflow on the other hand are not, because they tends to optimize much worse than expanded IR. Those are formed in CGP instead.

It should be noted that the saturating add/sub intrinsics are not intended primarily as vector intrinsics -- even though that's the case where they will most typically select to hardware instructions rather than be expanded in DAG. The intrinsic representation is simply generally useful for optimization.

Yes. I was going off the prior art for adding sadd_sat and ssub_sat to instcombine. And from the cases this patch is matching, where we are otherwise extending to a higher type, the intrinsic seems to produce equal or better code in most of the cases I've tried now. So this wasn't just for vectorisation, although it does make things a lot simpler there. (On an arm specific note, we have a scalar qadd instruction that can be used, if we can sort out the "q" flag otherwise being visible from C).

If the canonical form for one of these signed saturating adds/subs wasn't an intrinsic, what would it be? Going into a higher type is awkward for us because the i64 add is not legal, and so doesn't look like the kind of instruction that should be vectorised, plus in ISel we'd have to catch the lowering fairly early and do something special.

Yes. I was going off the prior art for adding sadd_sat and ssub_sat to instcombine. And from the cases this patch is matching, where we are otherwise extending to a higher type, the intrinsic seems to produce equal or better code in most of the cases I've tried now. So this wasn't just for vectorisation, although it does make things a lot simpler there. (On an arm specific note, we have a scalar qadd instruction that can be used, if we can sort out the "q" flag otherwise being visible from C).

If the canonical form for one of these signed saturating adds/subs wasn't an intrinsic, what would it be? Going into a higher type is awkward for us because the i64 add is not legal, and so doesn't look like the kind of instruction that should be vectorised, plus in ISel we'd have to catch the lowering fairly early and do something special.

I haven't looked at the patch in detail, but as author of at least part of the prior art cited here, I agree with the direction*. I also participated in some of the vector idioms discussions from a few years ago. There's overlap with the vector idiom problems, but as noted, these are generic (scalar too) math ops, so it's not exactly the same. We invested significantly in IR analysis and codegen for the math intrinsics, so that may have changed the thinking. I don't remember the sequence of events or if there was a dedicated llvm-dev thread for this, but the general idea is that if we have a generic intrinsic for the math and can easily invert the transform in the backend for targets/types that are not supported, try to canonicalize to the intrinsic.

https://bugs.llvm.org/show_bug.cgi?id=43580 may show another example where transforming to an intrinsic would help. cc @craig.topper @xbolva00
*We should have codegen tests in place for multiple targets/types, and it appears that is added with D68643.

I haven't looked at the patch in detail, but as author of at least part of the prior art cited here, I agree with the direction*. I also participated in some of the vector idioms discussions from a few years ago. There's overlap with the vector idiom problems, but as noted, these are generic (scalar too) math ops, so it's not exactly the same. We invested significantly in IR analysis and codegen for the math intrinsics, so that may have changed the thinking. I don't remember the sequence of events or if there was a dedicated llvm-dev thread for this, but the general idea is that if we have a generic intrinsic for the math and can easily invert the transform in the backend for targets/types that are not supported, try to canonicalize to the intrinsic.

Don't get me wrong. I'm not against the direction. I just want to make sure we have a general agreement with the rest of the community on where and how to draw the line ---- and get it documented (and the document to be kept up-to-date). Then, we can go back to the idiom list we created and determine which ones should be canonicalized, and which ones should go to better pattern matchers --- reflect that into the document. That'll make it easier for other interested people to actually work on those, w/o having to make a point on individual cases. That's the main purpose of raising this question.

I just want to make sure we have a general agreement with the rest of the community..

That's easy enough for this patch. (Well, IT issues aside..)

.. on where and how to draw the line ..

Not sure about in general though!

  • and get it documented (and the document to be kept up-to-date).

Any ideas what it would look like? And where this would live? At some point when we get into enough of the details the code becomes the best documentation, but I'm guessing you will disagree with that...

.. on where and how to draw the line ..

Not sure about in general though!

  • and get it documented (and the document to be kept up-to-date).

Any ideas what it would look like? And where this would live? At some point when we get into enough of the details the code becomes the best documentation, but I'm guessing you will disagree with that...

I'm not sure this can really be documented and enforced.

I'm in favor of treating signed saturation as canonical. The issue in delaying detection of such cases to instruction selection is the volatility of the IR: there is no guarantee that the IR will remain in the same form (expected by isel) from one day to the next. For example, some optimization may decide to just promote the operations to the wider type and only do the extension/truncate once, depending on how many saturating operations may be near one another. Handling this variability in isel is just not feasible.

fhahn added a subscriber: fhahn.Oct 10 2019, 11:30 AM

I'm in favor of treating signed saturation as canonical. The issue in delaying detection of such cases to instruction selection is the volatility of the IR: there is no guarantee that the IR will remain in the same form (expected by isel) from one day to the next. For example, some optimization may decide to just promote the operations to the wider type and only do the extension/truncate once, depending on how many saturating operations may be near one another. Handling this variability in isel is just not feasible.

I don't want to hijack this review. Let me have the rest of the discussion in the form of RFC on llvm-dev. Vector idiom discussions resulted in ~20 idioms. It would be nice if we can come up with a basic guideline on how to think and how to make a case for the new canonical form. For example, TI said they are interested in saturating mul. If saturating add/sub have a canonical form, I don't see why saturating mul should not.

nikic added a comment.Oct 12 2019, 2:09 AM

Generally looks good to me, I'm only wondering whether the trunc is the right place to start the match. Starting from the min/max we could match a larger set of patterns, in particular those where the result of the saturation is still extended to a larger type -- for example doing a 16-bit saturating add but continuing with a 32-bit result.

llvm/include/llvm/IR/PatternMatch.h
663 ↗(On Diff #223875)

nit: ///

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
729 ↗(On Diff #223875)

Rather than exact type equality, we could require that the original type is <= the trunc type and sext to the trunc type. This would allow also matching a saturating add between 16-bit and 8-bit number, for example. Not sure how practically relevant that would be though.

At some point when we get into enough of the details the code becomes the best documentation, but I'm guessing you will disagree with that...

First up, sorry about this line. I was in a bit of a mood over emails not working and someone distracted me, I pressed submit before I had properly constructed what I really wanted to say. Sorry if it came across as accusatory or passive aggressive. I was really wanting to say that; strong specifications, whilst they can be very useful, can also be stifling to innovation and bog down forward progress. We should try to keep a balance between writing things down and getting things done :)

Secondly, I had to revert D68643 because it wasn't sign extending the values as it should. Simple cases were working, but more complex examples were not. This will mostly effect illegal types (like i4's or i8/i16 on arm), but will make the results look less impressive. I will put up another patch together showing the correct results, which will contain more extends in places. I don't think it changes the decision of whether to use the intrinsics, just moves a few cases from looking like "an improvement", to just being "the same". And still better for many cases like vectorisation.

I don't want to hijack this review. Let me have the rest of the discussion in the form of RFC on llvm-dev. Vector idiom discussions resulted in ~20 idioms. It would be nice if we can come up with a basic guideline on how to think and how to make a case for the new canonical form. For example, TI said they are interested in saturating mul. If saturating add/sub have a canonical form, I don't see why saturating mul should not.

I was going to say yeah, sounds great to me, let me look into adding it.. But I took another look at the MVE spec and we apparently don't have that instruction. There are some that do multiplying + doubling and saturate the high half. Apparently that's useful to someone! I don't believe it would be a detriment to have a saturating mul though, so sounds good to me if someone wants to add it.

Do you happen to know if that discussion you spoke of was written down? It sounds very interesting. One thing I would like to try is to kick llvm's use of reductions up to 11, and it might provide useful insights on the best way to make that work.

I'm only wondering whether the trunc is the right place to start the match. Starting from the min/max we could match a larger set of patterns, in particular those where the result of the saturation is still extended to a larger type -- for example doing a 16-bit saturating add but continuing with a 32-bit result.

Yeah I was wondering if that would be better. So long as it doesn't create strange types, I think it sounds like a good idea. I'll give it a try and let you know how it looks.

dmgreen updated this revision to Diff 224818.Oct 14 2019, 1:22 AM

Update to start from a select, not the trunc.

lebedev.ri added inline comments.Oct 14 2019, 3:09 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2032–2035

Can you tell which test is not folded because of this check?

2043–2046

This is ugly.
There should be a way to do that by passing an old type and a new bitwidth.

2048–2057

This should be closer to something like

Value *A, *B;
if(!match(AddSub, m_BinOp(m_SExt(m_Value(A)), m_SExt(m_Value(B)))))
  return nullptr'

Function *F;
if (AddSub.getOpcode() == Instruction::Add)
  F = Intrinsic::getDeclaration(MinMax1.getModule(), Intrinsic::sadd_sat,
                                NewTy);
else if (AddSub.getOpcode() == Instruction::Sub)
  F = Intrinsic::getDeclaration(MinMax1.getModule(), Intrinsic::ssub_sat,
                                NewTy);
else
  return nullptr;
2058–2060

I don't understand why this test is here.
You know you matched sext on the way to A and B.

dmgreen marked 6 inline comments as done.Oct 14 2019, 7:07 AM
dmgreen added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2032–2035

Ah. That was meant to be sadd_sat31, but then I re-added back in some extra checks. I'll change it to a sadd_sat4 which should test this.

I presume it's a good idea to prevent the creation i4's?

2043–2046

Do you mean the whole "if (Ty->isVectorTy()).. "
or just the "VectorType::get(..)" part?

The second one looks simple, by using getElementCount. Let me know if you meant the first.

2058–2060

I originally left this out of the new patch, but think it is still needed. We could be SExtending from something larger than the NewBitWidth, and if it does contain higher bits the results of the add might be different to what we expect.

Something like https://rise4fun.com/Alive/WZvW, if I have that correct (I don't know if there is a way to say "sadd_sat" exactly. Let me know if you know of a way, I thought I'd seen something like that before but can't find it now. The idea is to try and replicate what that does in the i9 part).

dmgreen updated this revision to Diff 224858.Oct 14 2019, 7:18 AM
dmgreen marked 2 inline comments as done.

Address comments.

lebedev.ri added inline comments.Oct 14 2019, 7:49 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2031
/// In what bitwidth can this be treated as saturating arithmetics?
2032–2035

Eh, good point.
Not sure honestly, i don't remember ever seeing that check in any of the transforms recently-ish added.

2043–2046

The entirety of the block i highlighted.

2048–2057

Now that i think about it, the only really differing part is the new opcode, so

ZZZ *NewOpcode;
if (AddSub->getOpcode() == Instruction::Add)
  NewOpcode = Intrinsic::sadd_sat;
else if (AddSub->getOpcode() == Instruction::Sub)
  NewOpcode = Intrinsic::ssub_sat;
else
  return nullptr;

auto* F = Intrinsic::getDeclaration(MinMax1.getModule(), Intrinsic::ssub_sat, NewTy);

even.

2058–2060

Please move to right after if(!match(AddSub, m_BinOp(m_SExt(.
I guess i see what you mean, please ensure there is a test for that.

2061

This needs more meaningful comment.

dmgreen updated this revision to Diff 224899.Oct 14 2019, 1:45 PM
dmgreen marked 5 inline comments as done.

Added a changeBitWidth to Type, and other cleanup.

lebedev.ri edited the summary of this revision. (Show Details)Oct 16 2019, 12:20 PM

Even though we don't care about trunc, in all the tests there is a trunc in the end.
Can you either drop it from tests, or at least add a test without trunc?

Looks good to me in principle.

I'm pretty sure there are two more patterns that can be handled - you are using extension
because in original code [signed] overflow is undefined.
But that is only the case in IR if nsw is specified.
You can write it without UB, e.g.:

----------------------------------------
  %agg = uadd_overflow {i4, i1} %a, %b
  %add = extractvalue {i4, i1} %agg, 0
  %ov = extractvalue {i4, i1} %agg, 1
  %res = select %ov, i4 -1, %add
  ret i4 %res
=>
  %res = uadd_sat i4 %a, %b
  ret i4 %res
  %agg = uadd_overflow {i4, i1} %a, %b
  %ov = extractvalue {i4, i1} %agg, 1
  %add = extractvalue {i4, i1} %agg, 0

Done: 1
Optimization is correct!

yet we don't handle it:
https://godbolt.org/z/6VAnoL

That can also be written without @llvm.uadd.with.overflow(), and the same for signed.

llvm/include/llvm/IR/Type.h
388 ↗(On Diff #224899)

Can you use it somewhere in existing code and land this change separately?

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2032–2035

@spatel @nikic thoughts? I'm almost thinking this should not be here.

nikic added a comment.Oct 16 2019, 1:23 PM

@lebedev.ri I believe that pattern is principally handled in canonicalizeSaturatedAdd(), it just depends on the way in which the overflow check is expressed. Apparently it checks for the ~X u< Y ? -1 : X + Y pattern, but not for (X + Y) < X ? -1 : X + Y, which is our canonical non-constant add overflow check form, I think.

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2032–2035

I think this check makes sense in this case, because we are creating a type from integer constants. E.g. if those happen to be -256 and 255, we'd create a sadd.sat.i9 call, which will introduce the use of an "unusual" type that was not present in the source program. I think that's something we should generally avoid, as it will likely result in an unfavorable lowering in the backend.

dmgreen marked 3 inline comments as done.Oct 17 2019, 4:08 PM

Thanks for the alive proofs. I had written a patch to do the same thing (for sat's at least), but it looks like you already have something similar that handles overflows too.

@lebedev.ri I believe that pattern is principally handled in canonicalizeSaturatedAdd(), it just depends on the way in which the overflow check is expressed. Apparently it checks for the ~X u< Y ? -1 : X + Y pattern, but not for (X + Y) < X ? -1 : X + Y, which is our canonical non-constant add overflow check form, I think.

I will look into this. Thanks for the pointers!

llvm/include/llvm/IR/Type.h
388 ↗(On Diff #224899)

D69139, as you already saw.

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2032–2035

InstCombine used to be more strict about changing to illegal types before shouldChangeType was relaxed to at least include common types. I've not ran performance numbers either way over whether it's better to be less strict about this. It likely doesn't come up a lot in this case.

The checks we currently have here will only allow generate sats if there was an existing value of that type, from the sext (I think). sext from a larger type will not be valid to transform, and if both inputs are smaller the min/max will be superfluous (so we just use a standard add). So at least one of the inputs needs to be the same size as the final sat.

Whether we should turn a sext from an illegal type into a sadd_sat with an illegal type is the question then. I'd still go with no, but that's mostly because it feel's simpler (and it's likely does not come up very often).

dmgreen updated this revision to Diff 225542.Oct 17 2019, 4:12 PM
dmgreen marked an inline comment as done.

Mostly rebased out D69139, but also added a notrunc test.

lebedev.ri accepted this revision.Mon, Oct 21, 3:08 AM

Okay, LG.
@nikic ?

This revision is now accepted and ready to land.Mon, Oct 21, 3:08 AM
nikic accepted this revision.Mon, Oct 21, 11:40 AM

LG

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2053

I'd use Ty->getScalarSizeInBits() rather than Ty->getScalarType()->getIntegerBitWidth() here and above. I believe that's idiomatic even if we're dealing specifically in integers.

lebedev.ri marked an inline comment as done.Mon, Oct 21, 11:54 AM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
2053

Oh, thanks for spotting this, i thought i posted about this, but i guess it was in an inline comment that i deleted before posting.

This revision was automatically updated to reflect the committed changes.
dmgreen marked an inline comment as done.

LNT reports 7% slowdown for SingleSource/UnitTests/Vectorizer/gcc-loops. Hard to say if noise, since this was the last LNT build.

https://lnt.llvm.org/db_default/v4/nts/130142

Hello, Looks like noise to me: http://lnt.llvm.org/db_default/v4/nts/graph?plot.0=1344.1605916.3&highlight_run=130142

I don't think that benchmark contains anything that would hit this code. If you compare gcc-loops in these two runs, the hashes of the binaries are the same:
https://lnt.llvm.org/api/db_default/v4/nts/runs/130141
https://lnt.llvm.org/api/db_default/v4/nts/runs/130142
Why the first was repeatedly 1.3 and the later repeatedly 1.4, i'm not sure, but it looks like it's been bouncing around like that for a while.

(Also I feel that the hashes not changing should really appear in the run results somewhere! It's useful to filter out the noise)

Thanks!

Yes, it would be great to show that binaries are same.