Page MenuHomePhabricator

[SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
ClosedPublic

Authored by rtereshin on Jul 2 2018, 2:53 PM.

Details

Summary

as well as [zs]ext(C + x + ...) -> (D + [zs]ext(C-D + x + ...))<nuw><nsw>

if the top level addition in (D + (C-D + x * n)) could be proven to
not wrap, where the choice of D also maximizes the number of trailing
zeroes of (C-D + x * n), ensuring homogeneous behaviour of the
transformation and better canonicalization of such AddRec's

(indeed, there are 2^(2w) different expressions in B1 + ext(B2 + Y) form for
the same Y, but only 2^(2w - k) different expressions in the resulting
B3 + ext((B4 * 2^k) + Y) form, where w is the bit width of the integral type)

The AddExpr version of the transformation enables better canonicalization
of expressions like

1 + zext(5 + 20 * %x + 24 * %y)  and
    zext(6 + 20 * %x + 24 * %y)

which get both transformed to

2 + zext(4 + 20 * %x + 24 * %y)

This pattern is common in address arithmetics and the transformation
makes it easier for passes like LoadStoreVectorizer to prove that 2 or
more memory accesses are consecutive and optimize (vectorize) them.

I found this change similar to a number of other changes to Scalar Evolution, namely:

commit 63c52aea76b530d155ec6913d5c3bbe1ecd82ad8
Author: Sanjoy Das <sanjoy@playingwithpointers.com>
Date:   Thu Oct 22 19:57:38 2015 +0000

    [SCEV] Commute zero extends through <nuw> additions

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@251052 91177308-0d34-0410-b5e6-96231b3b80d8
commit 3edd5bf90828613bacfdc2ce047d3776363123e5
Author: Justin Lebar <jlebar@google.com>
Date:   Thu Jun 14 17:13:48 2018 +0000

    [SCEV] Simplify zext/trunc idiom that appears when handling bitmasks.

    Summary:
    Specifically, we transform

      zext(2^K * (trunc X to iN)) to iM ->
      2^K * (zext(trunc X to i{N-K}) to iM)<nuw>

    This is helpful because pulling the 2^K out of the zext allows further
    optimizations.

    Reviewers: sanjoy

    Subscribers: hiraditya, llvm-commits, timshen

    Differential Revision: https://reviews.llvm.org/D48158

and the most relevant

commit 45788be6e2603ecfc149f43df1a6d5e04c5734d8
Author: Michael Zolotukhin <mzolotukhin@apple.com>
Date:   Sat May 24 08:09:57 2014 +0000

    Implement sext(C1 + C2*X) --> sext(C1) + sext(C2*X) and
    sext{C1,+,C2} --> sext(C1) + sext{0,+,C2} transformation in Scalar
    Evolution.

    That helps SLP-vectorizer to recognize consecutive loads/stores.

    <rdar://problem/14860614>

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@209568 91177308-0d34-0410-b5e6-96231b3b80d8

This patch generalizes the latter one by relaxing the requirements the following way:

  1. C2 doesn't have to be a power of 2, it enough if it's divisible by 2 a sufficient number of times;
  2. C1 doesn't have to be less than C2, instead of extracting the entire C1 we can split it into 2 terms: (00...0XXX + YY...Y000), keep the second one that may cause wrapping within the extension operator, and move the first one that doesn't affect wrapping out of the extension operator, enabling further simplifications;
  3. C1 and C2 don't have to be positive, splitting C1 like shown above produces a sum that is guaranteed to not wrap, signed or unsigned;
  4. in AddExpr case there could be more than 2 terms, and in case of AddExpr the 2nd and following terms and in case of AddRecExpr the Step component don't have to be in the C2*X form or constant (respectively), they just need to have enough trailing zeros, which in turn could be guaranteed by means other than arithmetics, e.g. by a pointer alignment;
  5. the extension operator doesn't have to be a sext, the same transformation works and profitable for zext's as well.

Apparently, optimizations like SLPVectorizer currently fail to
vectorize even rather trivial cases like the following:

double bar(double *a, unsigned n) {

double x = 0.0;
double y = 0.0;
for (unsigned i = 0; i < n; i += 2) {
  x += a[i];
  y += a[i + 1];
}
return x * y;

}

If compiled with clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm
(!{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"})

it produces scalar code with the loop not unrolled with the unsigned n and
i (like shown above), but vectorized and unrolled loop with signed n and
i (follow https://godbolt.org/g/nq9xF8 to play with it).

With the changes made in this commit the unsigned version will be
vectorized (though not unrolled for unclear reasons).

Diff Detail

Repository
rL LLVM

Event Timeline

rtereshin created this revision.Jul 2 2018, 2:53 PM
rtereshin added inline comments.Jul 2 2018, 3:03 PM
lib/Analysis/ScalarEvolution.cpp
1813 ↗(On Diff #153796)

Another thing to discuss here is the fact that SCEV appears to be relying on value range analysis implemented via ConstantRange instead of KnownBits. It appears to me that we could achieve better results if we used both simultaneously updating them properly. See the example above justifying that.

Do you think it's worth bringing up at dev mailing list level?

rtereshin added inline comments.Jul 2 2018, 3:49 PM
lib/Analysis/ScalarEvolution.cpp
1813 ↗(On Diff #153796)

Now's the adventurous bit: KnownBits is able to prove that C1 + (C2 * 2^n * X) doesn't wrap if C1 < 2^n precisely because KnownBits operates over the arithmetic base of 2. If KnownBits operated over base of 3, for example, we could use it to prove that C1 + (C2 * 3^n * X) doesn't wrap (for instance, u * 3 + 1. Indeed, if bits of 3 are all unknown, KnownBits<base 3> of (u * 3) is XXXX XXX0 and therefore u * 3 + C doesn't wrap for any C <- {0, 1, 2}). I suspect that it could be proven that there is a basis (in linear algebra sense) in the system of KnownBits' that is sufficient: KnownBits over prime numbers.

So let's say for every SCEV expression we cache not just ConstantRange, but every non-trivial KnownBits<B>, where B (the base) <- {p1, p2, ..., pK}, p<i> is i-th prime number, K is some reasonable limit, and "non-trivial" means "not all bits (or rather digits) are unknown", and we use that information to effectively restore <nuw>/<nsw> flags where needed.

Could you use ScalarEvolution::GetMinTrailingZeros instead of calling computeKnownBits directly?

Could you use ScalarEvolution::GetMinTrailingZeros instead of calling computeKnownBits directly?

I don't think so. I need to know the maximum unsigned number I can safely subtract from an expression (w/o wrapping). Number of trailing zeros is of no use for this, I think.
GetMinTrailingOnes, if it existed, wouldn't help much either: let's say, the expression is x * 8 + 6. The number of trailing ones here is 0, though we can safely subtract any number from 0 to 6 (inclusive) (KnownBits are XXXX X110).

To further entertain the idea, let's notice that the upper limit is not always the constant part of the expression. If it's x * 4 + 6 we can only subtract 0, 1, or 2.
Of course, we could use GetMinTrailingZeros(x * 4) = 2, turn this into a 0000 0011 mask, apply the mask to the constant part (0000 0110), get 10 (2) as the maximum safe subtrahend.

But what if there is more than 2 terms? For instance, the expression is 6 + 20 * %x + 24 * %y. Do I need to rebuild an add from 20 * %x and 24 * %y just to apply GetMinTrailingZeros to the result?

Or I could extract the part of the GetMinTrailingZeros that handles add as a separate method so it could be used here. Do you think any of it is a better solution that applying computeKnownBits directly?

rtereshin added inline comments.Jul 3 2018, 9:41 AM
lib/Analysis/ScalarEvolution.cpp
1813 ↗(On Diff #153796)

Ah, I just realized that due to the unfortunate fact that (2^4 - 1) is divisible by 3 and 5, KnownBits over base 3 won't allow us to prove that 3*u + 1 doesn't wrap, as it very well may. It will allow us to prove though that 3*u + 1, 3*u + 2, and 3*u + 3 are consecutive, but it's probably not as useful as if we could start from 0. Same for base 5.

mkazantsev added inline comments.Jul 9 2018, 10:23 PM
lib/Analysis/ScalarEvolution.cpp
1816 ↗(On Diff #153796)

I don't understand why we need this. computeKnownBits is used to deduce ranges of SCEVUnknown. All other SCEV nodes are supposed to propagate range information (e.g. range of sum is a range from sum of min to sum of max, and so on). Thus, in theory, we should be able to identify the range of any SCEV correctly, unless we have some missing logic in range calculation.

What is OpV in the example you're trying to improve, and why SCEV was unable to deduce its range via getUnsignedRange(getSCEV(OpV))?

mkazantsev added inline comments.Jul 9 2018, 10:25 PM
test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150 ↗(On Diff #153796)

Its weird. Why signed and unsigned ranges are different?

rtereshin added inline comments.Jul 9 2018, 11:12 PM
lib/Analysis/ScalarEvolution.cpp
1816 ↗(On Diff #153796)

What is OpV in the example you're trying to improve

One of the examples is given in the comment nearby:

// (zext (add (shl X, C1), C2)), for instance, (zext (5 + (4 * X))).
// ConstantRange is unable to prove that 1 + (4 + 4 * X) doesn't wrap in
// such cases:
//
// | Expression |     ConstantRange      |       KnownBits       |
// |------------|------------------------|-----------------------|
// | 4 * X      | [L: 0, U: 253)         | XXXX XX00             |
// |            |   => Min: 0, Max: 252  |   => Min: 0, Max: 252 |
// |            |                        |                       |
// | 4 * X + 4  | [L: 4, U: 1) (wrapped) | YYYY YY00             |
// |            |   => Min: 0, Max: 255  |   => Min: 0, Max: 252 |

see also lldb session running a similar example, also present in test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll updated in this patch:

   1814	      if (OpV) {
   1815	        const DataLayout &DL = getDataLayout();
   1816	        KnownBits Known = computeKnownBits(OpV, DL, 0, &AC, nullptr, &DT);
-> 1817	        MinValue = Known.One.ugt(MinValue) ? Known.One : MinValue;
   1818	      }
   1819	      APInt C = SC->getAPInt();
   1820	      APInt D = MinValue.ugt(C) ? C : MinValue;
Target 0: (opt) stopped.
(lldb) p OpV->dump()
  %t1 = add i8 %t0, 5
(lldb) p SA->dump()
(5 + (4 * %x))
(lldb) p MinValue
(llvm::APInt) $1 = {
  U = {
    VAL = 0
    pVal = 0x0000000000000000
  }
  BitWidth = 8
}
(lldb) p Known.One.dump()
APInt(8b, 1u 1s)
(lldb) p getUnsignedRange(SA)
(llvm::ConstantRange) $2 = {
  Lower = {
    U = {
      VAL = 5
      pVal = 0x0000000000000005
    }
    BitWidth = 8
  }
  Upper = {
    U = {
      VAL = 2
      pVal = 0x0000000000000002
    }
    BitWidth = 8
  }
}

and why SCEV was unable to deduce its range via getUnsignedRange(getSCEV(OpV))?

As you mentioned, the range information is calculated using the

e.g. range of sum is a range from sum of min to sum of max, and so on

principle. ConstantRange keeps track of min/max boundaries, but it completely loses any periodic information, like "the range contains only values divisible by 4". KnownBits behavior is exact opposite: its imprecise when it comes to boundaries, but it keeps track of the periodic information.

For instance, the only thing that is known about 4 * x is that the 2 least significant bits of the value are 0s. From ConstantRange perspective it only means that the value doesn't exceed 2^32 - 4 if treated as unsigned i32. It's completely unaware of the fact that 4 * x could not be 7, for instance. If we shift the range by adding 5 (4 * x + 5) min/max recomputation of the range leads to the wrapped around range [5, 2), that gives us no useful information about the minimum and maximum values (minimum is 0, maximum is 2^32 - 1). While from known bits we know that 4 * x + 5 looks like XXX...XX01, therefore the minimum value is 1, and the maximum value is 2^32 - 3.

Ranges and KnownBits are complementary to each other, neither is more precise than the other in all cases. If we want a value range analysis with good precision, we need to maintain and update both simultaneously.

rtereshin added inline comments.Jul 9 2018, 11:42 PM
test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150 ↗(On Diff #153796)

That's a good question. One thing I know is that the issue is orthogonal to this patch and exists on trunk:

%p1.zext = zext i8 %p1 to i16
-->  (zext i8 (8 + (4 * %x)) to i16) U: [0,253) S: [0,256)

(this is w/o this patch applied)

Perhaps unsigned range takes some knownbits-like information into account, while signed one doesn't.

rtereshin added inline comments.Jul 9 2018, 11:53 PM
test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150 ↗(On Diff #153796)
rtereshin added inline comments.Jul 10 2018, 12:11 AM
lib/Analysis/ScalarEvolution.cpp
1816 ↗(On Diff #153796)

@mkazantsev I think I see what's the source of the confusion: apparently, the current implementation tries to utilize knownbits-like information in a limited form of "number of trailing zeros", which is computed for Add the following way:

if (const SCEVAddExpr *A = dyn_cast<SCEVAddExpr>(S)) {
  // The result is the min of all operands results.
  uint32_t MinOpRes = GetMinTrailingZeros(A->getOperand(0));
  for (unsigned i = 1, e = A->getNumOperands(); MinOpRes && i != e; ++i)
    MinOpRes = std::min(MinOpRes, GetMinTrailingZeros(A->getOperand(i)));
  return MinOpRes;
}

https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5375-L5381

So it does work for expressions like 4 + 4 * x (well, sometimes, somehow that kind of information is there for unsigned ranges, but not for signed ranges), and that makes my comment inaccurate. I will change it to 5 + 4 * x example.

For 5 + 4 * x it doesn't work, of course, as the number of trailing zeroes is 0 (5 + 4 * x ~= XXX...XX01).

rtereshin updated this revision to Diff 154914.Jul 10 2018, 6:22 PM

Updated the comment from a misleading 4 + 4 * x example to a correct 5 + 4 * x one.

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

I agree, that would be a more generic and homogeneous solution. Using (ConstantRange, KnownBits) pair instead of (ConstantRange, minTrailingZeros) (let alone only one component of the latter pair) across Scalar Evolution consistently may also benefit the framework in a number of other ways. It's a more intrusive change though. For now I think I could try to go with the number of trailing zeros approach despite the loss in generality if you or community feel strongly against known bits used the way the are used now in this patch.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

I think this transformation reduces the number of possible operands of a zext, so it brings some of the expressions in C1 + zext(C2 + X) form to the same shape - often C3 + zext(C4 * 2^k + X) - which is canonicalization (if some of the constants are missing let's say they are just zeroes). There is 2^(2w) different pairs of constants (C1, C2), and only 2^(2w - k) different pair of (C3, C4 ^ 2^k), where w is the bit width of the type.

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

I suppose I should, what would you suggest?

rtereshin retitled this revision from [SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw> transform to [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform.
rtereshin edited the summary of this revision. (Show Details)

I've moved away from using KnownBits, extended the proposed transformation to AddRecs, generalized it for signed extensions as well, and unified it all with pre-existing sext-only transformations that handle a strict subset of cases.

There are 2 separate commits here planned, first non-intrusively adds zext(C + x + ...) -> (D + zext(C-D + x + ...))<nuw><nsw> transformation only (in it's no-KnownBits / no-API-changes version that could be seen in this patch) along with the tests from the initial version of this patch (mostly LoadStoreVectorizer-related), while the second commit brings the rest (as well as adds SLPVectorizer-targeting tests).

Hopefully this is better now.

rtereshin added a subscriber: bogner.

Hi,

Thanks for working on this! From the description the approach looks correct and promising. I glanced through the patch, and it looked good, but if you don't mind I'd like to have another look.

Thanks,
Michael

lib/Analysis/ScalarEvolution.cpp
1566–1568 ↗(On Diff #156116)

Could you please also add a description for FullExpr? It might be helpful to add even more examples here and describe the intended use (e.g. ConstantTerm is Start and FullExpr is Step of an AddRec expression).

1585–1588 ↗(On Diff #156116)

Just checking my understanding: we're basically finding the largest common denominator here, which is also a power of 2, right?

rtereshin added inline comments.Jul 19 2018, 3:48 PM
lib/Analysis/ScalarEvolution.cpp
1566–1568 ↗(On Diff #156116)

Thanks for looking into this!

Could you please also add a description for FullExpr?

Sure, will do.

It might be helpful to add even more examples here and describe the intended use (e.g. ConstantTerm is Start and FullExpr is Step of an AddRec expression).

The next overload is the one that handles AddRec with parameter names being ConstantStart and Step. Do you think the names are self-explanatory or I need to elaborate in a comment?

As for AddExpr-version here I'm going to elaborate what FullExpr is and hopefully that will be clear enough.

1585–1588 ↗(On Diff #156116)

we're basically finding the largest common denominator here, which is also a power of 2, right?

This

uint32_t TZ = BitWidth;
for (unsigned I = 1, E = FullExpr->getNumOperands(); I < E && TZ; ++I)
  TZ = std::min(TZ, SE.GetMinTrailingZeros(FullExpr->getOperand(I)))

piece effectively does exactly that, yes. Another way to look at it is to say that we have an AddExpr that looks like (C + x + y + ...), where C is a constant and x, y, ... are arbitrary SCEVs, and we're computing the minimum number of trailing zeroes guaranteed of the sum w/o the constant term: (x + y + ...). If, for example, those terms look like follows:

        i
XXXX...X000
YYYY...YY00
    ...
ZZZZ...0000

then the rightmost non-guaranteed zero bit (a potential one at i-th position above) can change the bits of the sum to the left, but it can not possibly change the bits to the right. So we can compute the number of trailing zeroes by taking a minimum between the numbers of trailing zeroes of the terms.

Now let's say that our original sum with the constant is effectively just C + X, where X = x + y + .... Let's say we've got 2 guaranteed trailing zeros for X:

         j
CCCC...CCCC
XXXX...XX00

Any bit of C to the left of j may in the end cause the C + X sum to wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not affect wrapping in any way. If the upper bits cause a wrap, it will be a wrap regardless of the values of the 2 least significant bits of C. If the upper bits do not cause a wrap, it won't be a wrap regardless of the values of the 2 bits on the right (again).

So let's split C to 2 constants like follows:

0000...00CC  = D
CCCC...CC00  = (C - D)

and the whole sum like D + (C - D + X). The second term of this new sum looks like this:

CCCC...CC00
XXXX...XX00
-----------
YYYY...YY00

The sum above (let's call it Y)) may or may not wrap, we don't know, so we need to keep it under a sext/zext. Adding D to that sum though will never wrap, signed or unsigned, if performed on the original bit width or the extended one, because all that that final add does is setting the 2 least significant bits of Y to the bits of D:

YYYY...YY00 = Y
0000...00CC = D
----------- <nuw><nsw>
YYYY...YYCC

Which means we can safely move that D out of the sext or zext and claim that the top-level sum neither sign wraps nor unsigned wraps.

Let's run an example, let's say we're working in i8s and the original expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes like this:

0001 0101  // 21
XXXX XX00  // 12x
YYYY Y000  // 8y

0001 0101  // 21
ZZZZ ZZ00  // 12x + 8y  // true, alternatively one can say that gcd(12, 8) is guaranteed to have 2 zeroes on the right

0000 0001  // D
0001 0100  // 21 - D = 20
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
WWWW WW00  // 21 - D + 12x + 8y = 20 + 12x + 8y

therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y)<nuw><nsw>

mzolotukhin added inline comments.Jul 19 2018, 4:05 PM
lib/Analysis/ScalarEvolution.cpp
1566–1568 ↗(On Diff #156116)

Thanks! I'm fine with whatever way you choose, I'd just like to see some example/description of what FullExpr should be.

For instance, I spent some time thinking that FullExpr would be, e.g. 3 + 4*x + 6*y (which was inspired by your examples) and I couldn't understand how you can ever get TZ not equal to 0. It all made sense in the end, but saying that ConstantStart is 3 and FullExpr is 4*x + 6*y would've saved me some time :)

1585–1588 ↗(On Diff #156116)

Thanks for the great explanation! I think it's worth having it or its shorter version somewhere in comments. And just to be clear: I think that the patch is already very well-commented (thank you for that!), my remarks are just nit-picks.

rtereshin added inline comments.Jul 19 2018, 4:24 PM
lib/Analysis/ScalarEvolution.cpp
1566–1568 ↗(On Diff #156116)

Thing is, you were right, SCEVAddExpr *FullExpr is 3 + 4*x + 6*y. It's the iteration (for (unsigned I = 1,...) that goes from operand 1 instead of operand 0.

I feel like it would be a little trickier to ask clients of this function to provide a reference (a pair of operand iterators, for instance) to 4*x + 6*y.

1585–1588 ↗(On Diff #156116)

You are welcome!

Hm... Do you think it could be better if I just put this as is in the commit message instead? If someone goes curious, they git blame and see a detailed explanation attributed to the exact version of the code that that explanation describes? This way we don't have a really huge comment in code that will most certainly get out of synch with the implementation at some point in the future.

mzolotukhin added inline comments.Jul 19 2018, 4:35 PM
lib/Analysis/ScalarEvolution.cpp
1566–1568 ↗(On Diff #156116)

Right, we start from 1! Anyways, some note explaining what's going on with FullExpr should help here.

1585–1588 ↗(On Diff #156116)

Do you think it could be better if I just put this as is in the commit message instead?

Yeah, that's a good idea.

rtereshin updated this revision to Diff 156411.Jul 19 2018, 5:38 PM

Updated and added comments as requested, renamed FullExpr to WholeAddExpr, and planned to add the following piece to the end of the commit message:

How it all works:

Let say we have an AddExpr that looks like (C + x + y + ...), where C
is a constant and x, y, ... are arbitrary SCEVs. Let's compute the
minimum number of trailing zeroes guaranteed of that sum w/o the
constant term: (x + y + ...). If, for example, those terms look like
follows:

        i
XXXX...X000
YYYY...YY00
   ...
ZZZZ...0000

then the rightmost non-guaranteed-zero bit (a potential one at i-th
position above) can change the bits of the sum to the left (and at
i-th position itself), but it can not possibly change the bits to the
right. So we can compute the number of trailing zeroes by taking a
minimum between the numbers of trailing zeroes of the terms.

Now let's say that our original sum with the constant is effectively
just C + X, where X = x + y + .... Let's also say that we've got 2
guaranteed trailing zeros for X:

        j
CCCC...CCCC
XXXX...XX00  // this is X = (x + y + ...)

Any bit of C to the left of j may in the end cause the C + X sum to
wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not
affect wrapping in any way. If the upper bits cause a wrap, it will be
a wrap regardless of the values of the 2 least significant bits of C.
If the upper bits do not cause a wrap, it won't be a wrap regardless
of the values of the 2 bits on the right (again).

So let's split C to 2 constants like follows:

0000...00CC  = D
CCCC...CC00  = (C - D)

and represent the whole sum as D + (C - D + X). The second term of
this new sum looks like this:

CCCC...CC00
XXXX...XX00
-----------  // let's add them up
YYYY...YY00

The sum above (let's call it Y)) may or may not wrap, we don't know,
so we need to keep it under a sext/zext. Adding D to that sum though
will never wrap, signed or unsigned, if performed on the original bit
width or the extended one, because all that that final add does is
setting the 2 least significant bits of Y to the bits of D:

YYYY...YY00 = Y
0000...00CC = D
-----------  <nuw><nsw>
YYYY...YYCC

Which means we can safely move that D out of the sext or zext and
claim that the top-level sum neither sign wraps nor unsigned wraps.

Let's run an example, let's say we're working in i8's and the original
expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes
like this:

0001 0101  // 21
XXXX XX00  // 12x
YYYY Y000  // 8y

0001 0101  // 21
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
0001 0100  // 21 - D = 20
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
WWWW WW00  // 21 - D + 12x + 8y = 20 + 12x + 8y

therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y))<nuw><nsw>
mzolotukhin accepted this revision.Jul 19 2018, 5:54 PM

I've looked at the patch one more time more carefully, it looks good to me!

Thanks,
Michael

lib/Analysis/ScalarEvolution.cpp
1570–1579 ↗(On Diff #156411)

Maybe move that comment to the explanation in the commit message too? It's not obvious what ConstantRange has to with the code around (I understand where it comes from, but I have the full context of the patch now - for someone browsing through the code later it won't be clear why we mention ConstantRange here).

This revision is now accepted and ready to land.Jul 19 2018, 5:54 PM
rtereshin added inline comments.Jul 19 2018, 6:29 PM
lib/Analysis/ScalarEvolution.cpp
1570–1579 ↗(On Diff #156411)

Yeah, it felt like out of place for a while now, that's a good suggestion, will do.

Thanks for accepting the patch!

rtereshin updated this revision to Diff 156635.Jul 20 2018, 4:27 PM

Removed an out of place comment (the example comparing ConstantRange and KnownBits)

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

Hi Eli,

I have moved away from using KnownBits, so no new API anymore, I have also generalized this for AddRec's and sext's, which effectively generalized pre-existing transforms for sext's and added them anew for zext's.

Do you think this is good to go now?

Thanks,
Roman

efriedma accepted this revision.Jul 24 2018, 11:21 AM

Yes, looks fine.

Yes, looks fine.

Thanks!

This revision was automatically updated to reflect the committed changes.