This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
ScalarEvolution.h
-
lib/Analysis/
-
Analysis/
18
ScalarEvolution.cpp
-
test/
-
Analysis/ScalarEvolution/
-
ScalarEvolution/
3
no-wrap-add-exprs.ll
-
Transforms/LoadStoreVectorizer/X86/
-
LoadStoreVectorizer/
-
X86/
-
codegenprepare-produced-address-math.ll

Differential D48853

[SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
ClosedPublic

Authored by rtereshin on Jul 2 2018, 2:53 PM.

Download Raw Diff

Details

Reviewers

sanjoy
mzolotukhin
volkan
efriedma

Commits

rGed047b018430: [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
rG1ba1f9310c26: [SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw><nsw> transform
rL337943: [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
rL337859: [SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw><nsw> transform

Summary

as well as [zs]ext(C + x + ...) -> (D + [zs]ext(C-D + x + ...))<nuw><nsw>

if the top level addition in (D + (C-D + x * n)) could be proven to
not wrap, where the choice of D also maximizes the number of trailing
zeroes of (C-D + x * n), ensuring homogeneous behaviour of the
transformation and better canonicalization of such AddRec's

(indeed, there are 2^(2w) different expressions in B1 + ext(B2 + Y) form for
the same Y, but only 2^(2w - k) different expressions in the resulting
B3 + ext((B4 * 2^k) + Y) form, where w is the bit width of the integral type)

The AddExpr version of the transformation enables better canonicalization
of expressions like

1 + zext(5 + 20 * %x + 24 * %y)  and
    zext(6 + 20 * %x + 24 * %y)

which get both transformed to

2 + zext(4 + 20 * %x + 24 * %y)

This pattern is common in address arithmetics and the transformation
makes it easier for passes like LoadStoreVectorizer to prove that 2 or
more memory accesses are consecutive and optimize (vectorize) them.

I found this change similar to a number of other changes to Scalar Evolution, namely:

commit 63c52aea76b530d155ec6913d5c3bbe1ecd82ad8
Author: Sanjoy Das <sanjoy@playingwithpointers.com>
Date:   Thu Oct 22 19:57:38 2015 +0000

    [SCEV] Commute zero extends through <nuw> additions

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@251052 91177308-0d34-0410-b5e6-96231b3b80d8

commit 3edd5bf90828613bacfdc2ce047d3776363123e5
Author: Justin Lebar <jlebar@google.com>
Date:   Thu Jun 14 17:13:48 2018 +0000

    [SCEV] Simplify zext/trunc idiom that appears when handling bitmasks.

    Summary:
    Specifically, we transform

      zext(2^K * (trunc X to iN)) to iM ->
      2^K * (zext(trunc X to i{N-K}) to iM)<nuw>

    This is helpful because pulling the 2^K out of the zext allows further
    optimizations.

    Reviewers: sanjoy

    Subscribers: hiraditya, llvm-commits, timshen

    Differential Revision: https://reviews.llvm.org/D48158

and the most relevant

commit 45788be6e2603ecfc149f43df1a6d5e04c5734d8
Author: Michael Zolotukhin <mzolotukhin@apple.com>
Date:   Sat May 24 08:09:57 2014 +0000

    Implement sext(C1 + C2*X) --> sext(C1) + sext(C2*X) and
    sext{C1,+,C2} --> sext(C1) + sext{0,+,C2} transformation in Scalar
    Evolution.

    That helps SLP-vectorizer to recognize consecutive loads/stores.

    <rdar://problem/14860614>

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@209568 91177308-0d34-0410-b5e6-96231b3b80d8

This patch generalizes the latter one by relaxing the requirements the following way:

C2 doesn't have to be a power of 2, it enough if it's divisible by 2 a sufficient number of times;
C1 doesn't have to be less than C2, instead of extracting the entire C1 we can split it into 2 terms: (00...0XXX + YY...Y000), keep the second one that may cause wrapping within the extension operator, and move the first one that doesn't affect wrapping out of the extension operator, enabling further simplifications;
C1 and C2 don't have to be positive, splitting C1 like shown above produces a sum that is guaranteed to not wrap, signed or unsigned;
in AddExpr case there could be more than 2 terms, and in case of AddExpr the 2nd and following terms and in case of AddRecExpr the Step component don't have to be in the C2*X form or constant (respectively), they just need to have enough trailing zeros, which in turn could be guaranteed by means other than arithmetics, e.g. by a pointer alignment;
the extension operator doesn't have to be a sext, the same transformation works and profitable for zext's as well.

Apparently, optimizations like SLPVectorizer currently fail to
vectorize even rather trivial cases like the following:

double bar(double *a, unsigned n) {

double x = 0.0;
double y = 0.0;
for (unsigned i = 0; i < n; i += 2) {
  x += a[i];
  y += a[i + 1];
}
return x * y;

}

If compiled with clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm
(!{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"})

it produces scalar code with the loop not unrolled with the unsigned n and
i (like shown above), but vectorized and unrolled loop with signed n and
i (follow https://godbolt.org/g/nq9xF8 to play with it).

With the changes made in this commit the unsigned version will be
vectorized (though not unrolled for unclear reasons).

Diff Detail

Repository: rL LLVM

Event Timeline

rtereshin created this revision.Jul 2 2018, 2:53 PM

Herald added subscribers: javed.absar, tpr. · View Herald TranscriptJul 2 2018, 2:53 PM

rtereshin added inline comments.Jul 2 2018, 3:03 PM

lib/Analysis/ScalarEvolution.cpp
1813	Another thing to discuss here is the fact that SCEV appears to be relying on value range analysis implemented via `ConstantRange` instead of `KnownBits`. It appears to me that we could achieve better results if we used both simultaneously updating them properly. See the example above justifying that. Do you think it's worth bringing up at dev mailing list level?

rtereshin added inline comments.Jul 2 2018, 3:49 PM

lib/Analysis/ScalarEvolution.cpp
1813	Now's the adventurous bit: KnownBits is able to prove that C1 + (C2 * 2^n * X) doesn't wrap if C1 < 2^n precisely because KnownBits operates over the arithmetic base of 2. If KnownBits operated over base of 3, for example, we could use it to prove that C1 + (C2 * 3^n * X) doesn't wrap (for instance, u * 3 + 1. Indeed, if bits of 3 are all unknown, KnownBits<base 3> of (u * 3) is XXXX XXX0 and therefore u * 3 + C doesn't wrap for any C <- {0, 1, 2}). I suspect that it could be proven that there is a basis (in linear algebra sense) in the system of KnownBits' that is sufficient: KnownBits over prime numbers. So let's say for every SCEV expression we cache not just ConstantRange, but every non-trivial KnownBits<B>, where B (the base) <- {p1, p2, ..., pK}, p<i> is i-th prime number, K is some reasonable limit, and "non-trivial" means "not all bits (or rather digits) are unknown", and we use that information to effectively restore <nuw>/<nsw> flags where needed.

Could you use ScalarEvolution::GetMinTrailingZeros instead of calling computeKnownBits directly?

In D48853#1150253, @efriedma wrote:

Could you use ScalarEvolution::GetMinTrailingZeros instead of calling computeKnownBits directly?

I don't think so. I need to know the maximum unsigned number I can safely subtract from an expression (w/o wrapping). Number of trailing zeros is of no use for this, I think.
GetMinTrailingOnes, if it existed, wouldn't help much either: let's say, the expression is x * 8 + 6. The number of trailing ones here is 0, though we can safely subtract any number from 0 to 6 (inclusive) (KnownBits are XXXX X110).

To further entertain the idea, let's notice that the upper limit is not always the constant part of the expression. If it's x * 4 + 6 we can only subtract 0, 1, or 2.
Of course, we could use GetMinTrailingZeros(x * 4) = 2, turn this into a 0000 0011 mask, apply the mask to the constant part (0000 0110), get 10 (2) as the maximum safe subtrahend.

But what if there is more than 2 terms? For instance, the expression is 6 + 20 * %x + 24 * %y. Do I need to rebuild an add from 20 * %x and 24 * %y just to apply GetMinTrailingZeros to the result?

Or I could extract the part of the GetMinTrailingZeros that handles add as a separate method so it could be used here. Do you think any of it is a better solution that applying computeKnownBits directly?

rtereshin added inline comments.Jul 3 2018, 9:41 AM

lib/Analysis/ScalarEvolution.cpp
1813	Ah, I just realized that due to the unfortunate fact that (2^4 - 1) is divisible by 3 and 5, KnownBits over base 3 won't allow us to prove that 3u + 1 doesn't wrap, as it very well may. It will allow us to prove though that 3u + 1, 3u + 2, and 3u + 3 are consecutive, but it's probably not as useful as if we could start from 0. Same for base 5.

ping

mkazantsev added inline comments.Jul 9 2018, 10:23 PM

lib/Analysis/ScalarEvolution.cpp
1816	I don't understand why we need this. `computeKnownBits` is used to deduce ranges of SCEVUnknown. All other SCEV nodes are supposed to propagate range information (e.g. range of sum is a range from sum of min to sum of max, and so on). Thus, in theory, we should be able to identify the range of any SCEV correctly, unless we have some missing logic in range calculation. What is `OpV` in the example you're trying to improve, and why SCEV was unable to deduce its range via `getUnsignedRange(getSCEV(OpV))`?

mkazantsev added inline comments.Jul 9 2018, 10:25 PM

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150	Its weird. Why signed and unsigned ranges are different?

rtereshin added inline comments.Jul 9 2018, 11:12 PM

lib/Analysis/ScalarEvolution.cpp
1816	What is OpV in the example you're trying to improve One of the examples is given in the comment nearby: // (zext (add (shl X, C1), C2)), for instance, (zext (5 + (4 * X))). // ConstantRange is unable to prove that 1 + (4 + 4 * X) doesn't wrap in // such cases: // // \| Expression \| ConstantRange \| KnownBits \| // \|------------\|------------------------\|-----------------------\| // \| 4 * X \| [L: 0, U: 253) \| XXXX XX00 \| // \| \| => Min: 0, Max: 252 \| => Min: 0, Max: 252 \| // \| \| \| \| // \| 4 * X + 4 \| [L: 4, U: 1) (wrapped) \| YYYY YY00 \| // \| \| => Min: 0, Max: 255 \| => Min: 0, Max: 252 \| see also lldb session running a similar example, also present in `test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll` updated in this patch: 1814 if (OpV) { 1815 const DataLayout &DL = getDataLayout(); 1816 KnownBits Known = computeKnownBits(OpV, DL, 0, &AC, nullptr, &DT); -> 1817 MinValue = Known.One.ugt(MinValue) ? Known.One : MinValue; 1818 } 1819 APInt C = SC->getAPInt(); 1820 APInt D = MinValue.ugt(C) ? C : MinValue; Target 0: (opt) stopped. (lldb) p OpV->dump() %t1 = add i8 %t0, 5 (lldb) p SA->dump() (5 + (4 * %x)) (lldb) p MinValue (llvm::APInt) $1 = { U = { VAL = 0 pVal = 0x0000000000000000 } BitWidth = 8 } (lldb) p Known.One.dump() APInt(8b, 1u 1s) (lldb) p getUnsignedRange(SA) (llvm::ConstantRange) $2 = { Lower = { U = { VAL = 5 pVal = 0x0000000000000005 } BitWidth = 8 } Upper = { U = { VAL = 2 pVal = 0x0000000000000002 } BitWidth = 8 } } and why SCEV was unable to deduce its range via getUnsignedRange(getSCEV(OpV))? As you mentioned, the range information is calculated using the e.g. range of sum is a range from sum of min to sum of max, and so on principle. `ConstantRange` keeps track of min/max boundaries, but it completely loses any periodic information, like "the range contains only values divisible by 4". `KnownBits` behavior is exact opposite: its imprecise when it comes to boundaries, but it keeps track of the periodic information. For instance, the only thing that is known about `4 * x` is that the 2 least significant bits of the value are 0s. From `ConstantRange` perspective it only means that the value doesn't exceed `2^32 - 4` if treated as unsigned `i32`. It's completely unaware of the fact that `4 * x` could not be 7, for instance. If we shift the range by adding 5 (`4 * x + 5`) min/max recomputation of the range leads to the wrapped around range `[5, 2)`, that gives us no useful information about the minimum and maximum values (minimum is `0`, maximum is `2^32 - 1`). While from known bits we know that `4 * x + 5` looks like `XXX...XX01`, therefore the minimum value is `1`, and the maximum value is `2^32 - 3`. Ranges and KnownBits are complementary to each other, neither is more precise than the other in all cases. If we want a value range analysis with good precision, we need to maintain and update both simultaneously.

rtereshin added inline comments.Jul 9 2018, 11:42 PM

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150	That's a good question. One thing I know is that the issue is orthogonal to this patch and exists on trunk: %p1.zext = zext i8 %p1 to i16 --> (zext i8 (8 + (4 * %x)) to i16) U: [0,253) S: [0,256) (this is w/o this patch applied) Perhaps unsigned range takes some knownbits-like information into account, while signed one doesn't.

rtereshin added inline comments.Jul 9 2018, 11:53 PM

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150	Maybe this is the spot: https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5594-L5613

rtereshin added inline comments.Jul 10 2018, 12:11 AM

lib/Analysis/ScalarEvolution.cpp
1816	@mkazantsev I think I see what's the source of the confusion: apparently, the current implementation tries to utilize knownbits-like information in a limited form of "number of trailing zeros", which is computed for `Add` the following way: if (const SCEVAddExpr A = dyn_cast<SCEVAddExpr>(S)) { // The result is the min of all operands results. uint32_t MinOpRes = GetMinTrailingZeros(A->getOperand(0)); for (unsigned i = 1, e = A->getNumOperands(); MinOpRes && i != e; ++i) MinOpRes = std::min(MinOpRes, GetMinTrailingZeros(A->getOperand(i))); return MinOpRes; } https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5375-L5381 So it does work for expressions like `4 + 4 x` (well, sometimes, somehow that kind of information is there for unsigned ranges, but not for signed ranges), and that makes my comment inaccurate. I will change it to `5 + 4 * x` example. For `5 + 4 * x` it doesn't work, of course, as the number of trailing zeroes is 0 (`5 + 4 * x` ~= `XXX...XX01`).

Updated the comment from a misleading 4 + 4 * x example to a correct 5 + 4 * x one.

ping

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

In D48853#1165748, @efriedma wrote:

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

I agree, that would be a more generic and homogeneous solution. Using (ConstantRange, KnownBits) pair instead of (ConstantRange, minTrailingZeros) (let alone only one component of the latter pair) across Scalar Evolution consistently may also benefit the framework in a number of other ways. It's a more intrusive change though. For now I think I could try to go with the number of trailing zeros approach despite the loss in generality if you or community feel strongly against known bits used the way the are used now in this patch.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

I think this transformation reduces the number of possible operands of a zext, so it brings some of the expressions in C1 + zext(C2 + X) form to the same shape - often C3 + zext(C4 * 2^k + X) - which is canonicalization (if some of the constants are missing let's say they are just zeroes). There is 2^(2w) different pairs of constants (C1, C2), and only 2^(2w - k) different pair of (C3, C4 ^ 2^k), where w is the bit width of the type.

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

I suppose I should, what would you suggest?

I've moved away from using KnownBits, extended the proposed transformation to AddRecs, generalized it for signed extensions as well, and unified it all with pre-existing sext-only transformations that handle a strict subset of cases.

There are 2 separate commits here planned, first non-intrusively adds zext(C + x + ...) -> (D + zext(C-D + x + ...))<nuw><nsw> transformation only (in it's no-KnownBits / no-API-changes version that could be seen in this patch) along with the tests from the initial version of this patch (mostly LoadStoreVectorizer-related), while the second commit brings the rest (as well as adds SLPVectorizer-targeting tests).

Hopefully this is better now.

Herald added a subscriber: dmgreen. · View Herald TranscriptJul 18 2018, 11:40 AM

rtereshin added a reviewer: efriedma.Jul 19 2018, 1:02 PM

rtereshin added a subscriber: bogner.

Hi,

Thanks for working on this! From the description the approach looks correct and promising. I glanced through the patch, and it looked good, but if you don't mind I'd like to have another look.

Thanks,
Michael

lib/Analysis/ScalarEvolution.cpp
1569–1571	Could you please also add a description for `FullExpr`? It might be helpful to add even more examples here and describe the intended use (e.g. `ConstantTerm` is `Start` and `FullExpr` is `Step` of an `AddRec` expression).
1588–1591	Just checking my understanding: we're basically finding the largest common denominator here, which is also a power of 2, right?

rtereshin added inline comments.Jul 19 2018, 3:48 PM

lib/Analysis/ScalarEvolution.cpp
1569–1571	Thanks for looking into this! Could you please also add a description for FullExpr? Sure, will do. It might be helpful to add even more examples here and describe the intended use (e.g. ConstantTerm is Start and FullExpr is Step of an AddRec expression). The next overload is the one that handles AddRec with parameter names being `ConstantStart` and `Step`. Do you think the names are self-explanatory or I need to elaborate in a comment? As for AddExpr-version here I'm going to elaborate what `FullExpr` is and hopefully that will be clear enough.
1588–1591	we're basically finding the largest common denominator here, which is also a power of 2, right? This uint32_t TZ = BitWidth; for (unsigned I = 1, E = FullExpr->getNumOperands(); I < E && TZ; ++I) TZ = std::min(TZ, SE.GetMinTrailingZeros(FullExpr->getOperand(I))) piece effectively does exactly that, yes. Another way to look at it is to say that we have an `AddExpr` that looks like `(C + x + y + ...)`, where `C` is a constant and x, y, ... are arbitrary SCEVs, and we're computing the minimum number of trailing zeroes guaranteed of the sum w/o the constant term: `(x + y + ...)`. If, for example, those terms look like follows: i XXXX...X000 YYYY...YY00 ... ZZZZ...0000 then the rightmost non-guaranteed zero bit (a potential one at i-th position above) can change the bits of the sum to the left, but it can not possibly change the bits to the right. So we can compute the number of trailing zeroes by taking a minimum between the numbers of trailing zeroes of the terms. Now let's say that our original sum with the constant is effectively just `C + X`, where `X = x + y + ...`. Let's say we've got 2 guaranteed trailing zeros for `X`: j CCCC...CCCC XXXX...XX00 Any bit of `C` to the left of `j` may in the end cause the `C + X` sum to wrap, but the rightmost 2 bits of `C` (at positions `j` and `j - 1`) do not affect wrapping in any way. If the upper bits cause a wrap, it will be a wrap regardless of the values of the 2 least significant bits of `C`. If the upper bits do not cause a wrap, it won't be a wrap regardless of the values of the 2 bits on the right (again). So let's split C to 2 constants like follows: 0000...00CC = D CCCC...CC00 = (C - D) and the whole sum like `D + (C - D + X)`. The second term of this new sum looks like this: CCCC...CC00 XXXX...XX00 ----------- YYYY...YY00 The sum above (let's call it `Y`)) may or may not wrap, we don't know, so we need to keep it under a sext/zext. Adding `D` to that sum though will never wrap, signed or unsigned, if performed on the original bit width or the extended one, because all that that final add does is setting the 2 least significant bits of `Y` to the bits of `D`: YYYY...YY00 = Y 0000...00CC = D ----------- <nuw><nsw> YYYY...YYCC Which means we can safely move that D out of the sext or zext and claim that the top-level sum neither sign wraps nor unsigned wraps. Let's run an example, let's say we're working in `i8`s and the original expression (zext's or sext's operand) is `21 + 12x + 8y`. So it goes like this: 0001 0101 // 21 XXXX XX00 // 12x YYYY Y000 // 8y 0001 0101 // 21 ZZZZ ZZ00 // 12x + 8y // true, alternatively one can say that gcd(12, 8) is guaranteed to have 2 zeroes on the right 0000 0001 // D 0001 0100 // 21 - D = 20 ZZZZ ZZ00 // 12x + 8y 0000 0001 // D WWWW WW00 // 21 - D + 12x + 8y = 20 + 12x + 8y therefore `zext(21 + 12x + 8y)` = `(1 + zext(20 + 12x + 8y)<nuw><nsw>`

mzolotukhin added inline comments.Jul 19 2018, 4:05 PM

lib/Analysis/ScalarEvolution.cpp
1569–1571	Thanks! I'm fine with whatever way you choose, I'd just like to see some example/description of what `FullExpr` should be. For instance, I spent some time thinking that `FullExpr` would be, e.g. `3 + 4x + 6y` (which was inspired by your examples) and I couldn't understand how you can ever get TZ not equal to 0. It all made sense in the end, but saying that `ConstantStart` is 3 and `FullExpr` is `4x + 6y` would've saved me some time :)
1588–1591	Thanks for the great explanation! I think it's worth having it or its shorter version somewhere in comments. And just to be clear: I think that the patch is already very well-commented (thank you for that!), my remarks are just nit-picks.

rtereshin added inline comments.Jul 19 2018, 4:24 PM

lib/Analysis/ScalarEvolution.cpp
1569–1571	Thing is, you were right, `SCEVAddExpr FullExpr` is `3 + 4x + 6y`. It's the iteration (`for (unsigned I = 1,...`) that goes from operand 1 instead of operand 0. I feel like it would be a little trickier to ask clients of this function to provide a reference (a pair of operand iterators, for instance) to `4x + 6*y`.
1588–1591	You are welcome! Hm... Do you think it could be better if I just put this as is in the commit message instead? If someone goes curious, they git blame and see a detailed explanation attributed to the exact version of the code that that explanation describes? This way we don't have a really huge comment in code that will most certainly get out of synch with the implementation at some point in the future.

mzolotukhin added inline comments.Jul 19 2018, 4:35 PM

lib/Analysis/ScalarEvolution.cpp
1569–1571	Right, we start from 1! Anyways, some note explaining what's going on with `FullExpr` should help here.
1588–1591	Do you think it could be better if I just put this as is in the commit message instead? Yeah, that's a good idea.

Updated and added comments as requested, renamed FullExpr to WholeAddExpr, and planned to add the following piece to the end of the commit message:

How it all works:

Let say we have an AddExpr that looks like (C + x + y + ...), where C
is a constant and x, y, ... are arbitrary SCEVs. Let's compute the
minimum number of trailing zeroes guaranteed of that sum w/o the
constant term: (x + y + ...). If, for example, those terms look like
follows:

        i
XXXX...X000
YYYY...YY00
   ...
ZZZZ...0000

then the rightmost non-guaranteed-zero bit (a potential one at i-th
position above) can change the bits of the sum to the left (and at
i-th position itself), but it can not possibly change the bits to the
right. So we can compute the number of trailing zeroes by taking a
minimum between the numbers of trailing zeroes of the terms.

Now let's say that our original sum with the constant is effectively
just C + X, where X = x + y + .... Let's also say that we've got 2
guaranteed trailing zeros for X:

        j
CCCC...CCCC
XXXX...XX00  // this is X = (x + y + ...)

Any bit of C to the left of j may in the end cause the C + X sum to
wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not
affect wrapping in any way. If the upper bits cause a wrap, it will be
a wrap regardless of the values of the 2 least significant bits of C.
If the upper bits do not cause a wrap, it won't be a wrap regardless
of the values of the 2 bits on the right (again).

So let's split C to 2 constants like follows:

0000...00CC  = D
CCCC...CC00  = (C - D)

and represent the whole sum as D + (C - D + X). The second term of
this new sum looks like this:

CCCC...CC00
XXXX...XX00
-----------  // let's add them up
YYYY...YY00

The sum above (let's call it Y)) may or may not wrap, we don't know,
so we need to keep it under a sext/zext. Adding D to that sum though
will never wrap, signed or unsigned, if performed on the original bit
width or the extended one, because all that that final add does is
setting the 2 least significant bits of Y to the bits of D:

YYYY...YY00 = Y
0000...00CC = D
-----------  <nuw><nsw>
YYYY...YYCC

Which means we can safely move that D out of the sext or zext and
claim that the top-level sum neither sign wraps nor unsigned wraps.

Let's run an example, let's say we're working in i8's and the original
expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes
like this:

0001 0101  // 21
XXXX XX00  // 12x
YYYY Y000  // 8y

0001 0101  // 21
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
0001 0100  // 21 - D = 20
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
WWWW WW00  // 21 - D + 12x + 8y = 20 + 12x + 8y

therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y))<nuw><nsw>

I've looked at the patch one more time more carefully, it looks good to me!

Thanks,
Michael

lib/Analysis/ScalarEvolution.cpp
1573–1582	Maybe move that comment to the explanation in the commit message too? It's not obvious what `ConstantRange` has to with the code around (I understand where it comes from, but I have the full context of the patch now - for someone browsing through the code later it won't be clear why we mention `ConstantRange` here).

This revision is now accepted and ready to land.Jul 19 2018, 5:54 PM

rtereshin added inline comments.Jul 19 2018, 6:29 PM

lib/Analysis/ScalarEvolution.cpp
1573–1582	Yeah, it felt like out of place for a while now, that's a good suggestion, will do. Thanks for accepting the patch!

Removed an out of place comment (the example comparing ConstantRange and KnownBits)

In D48853#1165748, @efriedma wrote:

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

Hi Eli,

I have moved away from using KnownBits, so no new API anymore, I have also generalized this for AddRec's and sext's, which effectively generalized pre-existing transforms for sext's and added them anew for zext's.

Do you think this is good to go now?

Thanks,
Roman

Yes, looks fine.

In D48853#1173829, @efriedma wrote:

Yes, looks fine.

Thanks!

Closed by commit rL337859: [SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw><nsw> transform (authored by rtereshin). · Explain WhyJul 24 2018, 2:49 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Analysis/

ScalarEvolution.h

4 lines

lib/

Analysis/

ScalarEvolution.cpp

56 lines

test/

Analysis/

ScalarEvolution/

no-wrap-add-exprs.ll

81 lines

Transforms/

LoadStoreVectorizer/

X86/

codegenprepare-produced-address-math.ll

78 lines

Diff 153796

include/llvm/Analysis/ScalarEvolution.h

Show First 20 Lines • Show All 517 Lines • ▼ Show 20 Lines	public:
/// expression.		/// expression.
const SCEV getSCEV(Value V);		const SCEV getSCEV(Value V);

const SCEV getConstant(ConstantInt V);		const SCEV getConstant(ConstantInt V);
const SCEV *getConstant(const APInt &Val);		const SCEV *getConstant(const APInt &Val);
const SCEV getConstant(Type Ty, uint64_t V, bool isSigned = false);		const SCEV getConstant(Type Ty, uint64_t V, bool isSigned = false);
const SCEV getTruncateExpr(const SCEV Op, Type *Ty);		const SCEV getTruncateExpr(const SCEV Op, Type *Ty);
const SCEV getZeroExtendExpr(const SCEV Op, Type *Ty, unsigned Depth = 0);		const SCEV getZeroExtendExpr(const SCEV Op, Type *Ty, unsigned Depth = 0);
		/// Sink zext towards the leaves of SCEV more aggressively using KnownBits,
		/// where the latter requires the original IR Value to be available.
		const SCEV getZeroExtendExprForValue(const Value OpV, const SCEV *Op,
		Type *Ty, unsigned Depth = 0);
const SCEV getSignExtendExpr(const SCEV Op, Type *Ty, unsigned Depth = 0);		const SCEV getSignExtendExpr(const SCEV Op, Type *Ty, unsigned Depth = 0);
const SCEV getAnyExtendExpr(const SCEV Op, Type *Ty);		const SCEV getAnyExtendExpr(const SCEV Op, Type *Ty);
const SCEV getAddExpr(SmallVectorImpl<const SCEV > &Ops,		const SCEV getAddExpr(SmallVectorImpl<const SCEV > &Ops,
SCEV::NoWrapFlags Flags = SCEV::FlagAnyWrap,		SCEV::NoWrapFlags Flags = SCEV::FlagAnyWrap,
unsigned Depth = 0);		unsigned Depth = 0);
const SCEV getAddExpr(const SCEV LHS, const SCEV *RHS,		const SCEV getAddExpr(const SCEV LHS, const SCEV *RHS,
SCEV::NoWrapFlags Flags = SCEV::FlagAnyWrap,		SCEV::NoWrapFlags Flags = SCEV::FlagAnyWrap,
unsigned Depth = 0) {		unsigned Depth = 0) {
▲ Show 20 Lines • Show All 1,481 Lines • Show Last 20 Lines

lib/Analysis/ScalarEvolution.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,556 Lines • ▼ Show 20 Lines	if (PreAR && PreAR->getNoWrapFlags(WrapType)) { // proves (2)
if (Limit && isKnownPredicate(Pred, PreAR, Limit)) // proves (1)		if (Limit && isKnownPredicate(Pred, PreAR, Limit)) // proves (1)
return true;		return true;
}		}
}		}

return false;		return false;
}		}

const SCEV *		const SCEV ScalarEvolution::getZeroExtendExpr(const SCEV Op, Type *Ty,
ScalarEvolution::getZeroExtendExpr(const SCEV Op, Type Ty, unsigned Depth) {		unsigned Depth) {
		return getZeroExtendExprForValue(nullptr, Op, Ty, Depth);
		}

		const SCEV ScalarEvolution::getZeroExtendExprForValue(const Value OpV,
		const SCEV Op, Type Ty,
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could you please also add a description for `FullExpr`? It might be helpful to add even more examples here and describe the intended use (e.g. `ConstantTerm` is `Start` and `FullExpr` is `Step` of an `AddRec` expression). mzolotukhin: Could you please also add a description for `FullExpr`? It might be helpful to add even more…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Thanks for looking into this! Could you please also add a description for FullExpr? Sure, will do. It might be helpful to add even more examples here and describe the intended use (e.g. ConstantTerm is Start and FullExpr is Step of an AddRec expression). The next overload is the one that handles AddRec with parameter names being `ConstantStart` and `Step`. Do you think the names are self-explanatory or I need to elaborate in a comment? As for AddExpr-version here I'm going to elaborate what `FullExpr` is and hopefully that will be clear enough. rtereshin: Thanks for looking into this! > Could you please also add a description for FullExpr? Sure…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Thanks! I'm fine with whatever way you choose, I'd just like to see some example/description of what `FullExpr` should be. For instance, I spent some time thinking that `FullExpr` would be, e.g. `3 + 4x + 6y` (which was inspired by your examples) and I couldn't understand how you can ever get TZ not equal to 0. It all made sense in the end, but saying that `ConstantStart` is 3 and `FullExpr` is `4x + 6y` would've saved me some time :) mzolotukhin: Thanks! I'm fine with whatever way you choose, I'd just like to see some example/description of…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Thing is, you were right, `SCEVAddExpr FullExpr` is `3 + 4x + 6y`. It's the iteration (`for (unsigned I = 1,...`) that goes from operand 1 instead of operand 0. I feel like it would be a little trickier to ask clients of this function to provide a reference (a pair of operand iterators, for instance) to `4x + 6y`. rtereshin:* Thing is, you were right, `SCEVAddExpr FullExpr` is `3 + 4x + 6*y`. It's the iteration (`for…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Right, we start from 1! Anyways, some note explaining what's going on with `FullExpr` should help here. mzolotukhin: Right, we start from 1! Anyways, some note explaining what's going on with `FullExpr` should…
		unsigned Depth) {
assert(getTypeSizeInBits(Op->getType()) < getTypeSizeInBits(Ty) &&		assert(getTypeSizeInBits(Op->getType()) < getTypeSizeInBits(Ty) &&
"This is not an extending conversion!");		"This is not an extending conversion!");
assert(isSCEVable(Ty) &&		assert(isSCEVable(Ty) &&
"This is not a conversion to a SCEVable type!");		"This is not a conversion to a SCEVable type!");
Ty = getEffectiveSCEVType(Ty);		Ty = getEffectiveSCEVType(Ty);

// Fold if the operand is constant.		// Fold if the operand is constant.
if (const SCEVConstant *SC = dyn_cast<SCEVConstant>(Op))		if (const SCEVConstant *SC = dyn_cast<SCEVConstant>(Op))
return getConstant(		return getConstant(
cast<ConstantInt>(ConstantExpr::getZExt(SC->getValue(), Ty)));		cast<ConstantInt>(ConstantExpr::getZExt(SC->getValue(), Ty)));
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Maybe move that comment to the explanation in the commit message too? It's not obvious what `ConstantRange` has to with the code around (I understand where it comes from, but I have the full context of the patch now - for someone browsing through the code later it won't be clear why we mention `ConstantRange` here). mzolotukhin: Maybe move that comment to the explanation in the commit message too? It's not obvious what…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Yeah, it felt like out of place for a while now, that's a good suggestion, will do. Thanks for accepting the patch! rtereshin: Yeah, it felt like out of place for a while now, that's a good suggestion, will do. Thanks for…

// zext(zext(x)) --> zext(x)		// zext(zext(x)) --> zext(x)
if (const SCEVZeroExtendExpr *SZ = dyn_cast<SCEVZeroExtendExpr>(Op))		if (const SCEVZeroExtendExpr *SZ = dyn_cast<SCEVZeroExtendExpr>(Op))
return getZeroExtendExpr(SZ->getOperand(), Ty, Depth + 1);		return getZeroExtendExpr(SZ->getOperand(), Ty, Depth + 1);

// Before doing any expensive analysis, check to see if we've already		// Before doing any expensive analysis, check to see if we've already
// computed a SCEV for this Op and Ty.		// computed a SCEV for this Op and Ty.
FoldingSetNodeID ID;		FoldingSetNodeID ID;
ID.AddInteger(scZeroExtend);		ID.AddInteger(scZeroExtend);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Just checking my understanding: we're basically finding the largest common denominator here, which is also a power of 2, right? mzolotukhin: Just checking my understanding: we're basically finding the largest common denominator here…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions we're basically finding the largest common denominator here, which is also a power of 2, right? This uint32_t TZ = BitWidth; for (unsigned I = 1, E = FullExpr->getNumOperands(); I < E && TZ; ++I) TZ = std::min(TZ, SE.GetMinTrailingZeros(FullExpr->getOperand(I))) piece effectively does exactly that, yes. Another way to look at it is to say that we have an `AddExpr` that looks like `(C + x + y + ...)`, where `C` is a constant and x, y, ... are arbitrary SCEVs, and we're computing the minimum number of trailing zeroes guaranteed of the sum w/o the constant term: `(x + y + ...)`. If, for example, those terms look like follows: i XXXX...X000 YYYY...YY00 ... ZZZZ...0000 then the rightmost non-guaranteed zero bit (a potential one at i-th position above) can change the bits of the sum to the left, but it can not possibly change the bits to the right. So we can compute the number of trailing zeroes by taking a minimum between the numbers of trailing zeroes of the terms. Now let's say that our original sum with the constant is effectively just `C + X`, where `X = x + y + ...`. Let's say we've got 2 guaranteed trailing zeros for `X`: j CCCC...CCCC XXXX...XX00 Any bit of `C` to the left of `j` may in the end cause the `C + X` sum to wrap, but the rightmost 2 bits of `C` (at positions `j` and `j - 1`) do not affect wrapping in any way. If the upper bits cause a wrap, it will be a wrap regardless of the values of the 2 least significant bits of `C`. If the upper bits do not cause a wrap, it won't be a wrap regardless of the values of the 2 bits on the right (again). So let's split C to 2 constants like follows: 0000...00CC = D CCCC...CC00 = (C - D) and the whole sum like `D + (C - D + X)`. The second term of this new sum looks like this: CCCC...CC00 XXXX...XX00 ----------- YYYY...YY00 The sum above (let's call it `Y`)) may or may not wrap, we don't know, so we need to keep it under a sext/zext. Adding `D` to that sum though will never wrap, signed or unsigned, if performed on the original bit width or the extended one, because all that that final add does is setting the 2 least significant bits of `Y` to the bits of `D`: YYYY...YY00 = Y 0000...00CC = D ----------- <nuw><nsw> YYYY...YYCC Which means we can safely move that D out of the sext or zext and claim that the top-level sum neither sign wraps nor unsigned wraps. Let's run an example, let's say we're working in `i8`s and the original expression (zext's or sext's operand) is `21 + 12x + 8y`. So it goes like this: 0001 0101 // 21 XXXX XX00 // 12x YYYY Y000 // 8y 0001 0101 // 21 ZZZZ ZZ00 // 12x + 8y // true, alternatively one can say that gcd(12, 8) is guaranteed to have 2 zeroes on the right 0000 0001 // D 0001 0100 // 21 - D = 20 ZZZZ ZZ00 // 12x + 8y 0000 0001 // D WWWW WW00 // 21 - D + 12x + 8y = 20 + 12x + 8y therefore `zext(21 + 12x + 8y)` = `(1 + zext(20 + 12x + 8y)<nuw><nsw>` rtereshin: > we're basically finding the largest common denominator here, which is also a power of 2…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Thanks for the great explanation! I think it's worth having it or its shorter version somewhere in comments. And just to be clear: I think that the patch is already very well-commented (thank you for that!), my remarks are just nit-picks. mzolotukhin: Thanks for the great explanation! I think it's worth having it or its shorter version somewhere…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions You are welcome! Hm... Do you think it could be better if I just put this as is in the commit message instead? If someone goes curious, they git blame and see a detailed explanation attributed to the exact version of the code that that explanation describes? This way we don't have a really huge comment in code that will most certainly get out of synch with the implementation at some point in the future. rtereshin: You are welcome! Hm... Do you think it could be better if I just put this as is in the commit…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Do you think it could be better if I just put this as is in the commit message instead? Yeah, that's a good idea. mzolotukhin: > Do you think it could be better if I just put this as is in the commit message instead? Yeah…
ID.AddPointer(Op);		ID.AddPointer(Op);
ID.AddPointer(Ty);		ID.AddPointer(Ty);
void *IP = nullptr;		void *IP = nullptr;
if (const SCEV *S = UniqueSCEVs.FindNodeOrInsertPos(ID, IP)) return S;		if (const SCEV *S = UniqueSCEVs.FindNodeOrInsertPos(ID, IP)) return S;
if (Depth > MaxExtDepth) {		if (Depth > MaxExtDepth) {
SCEV *S = new (SCEVAllocator) SCEVZeroExtendExpr(ID.Intern(SCEVAllocator),		SCEV *S = new (SCEVAllocator) SCEVZeroExtendExpr(ID.Intern(SCEVAllocator),
Op, Ty);		Op, Ty);
UniqueSCEVs.InsertNode(S, IP);		UniqueSCEVs.InsertNode(S, IP);
▲ Show 20 Lines • Show All 181 Lines • ▼ Show 20 Lines	if (auto *SA = dyn_cast<SCEVAddExpr>(Op)) {
if (SA->hasNoUnsignedWrap()) {		if (SA->hasNoUnsignedWrap()) {
// If the addition does not unsign overflow then we can, by definition,		// If the addition does not unsign overflow then we can, by definition,
// commute the zero extension with the addition operation.		// commute the zero extension with the addition operation.
SmallVector<const SCEV *, 4> Ops;		SmallVector<const SCEV *, 4> Ops;
for (const auto *Op : SA->operands())		for (const auto *Op : SA->operands())
Ops.push_back(getZeroExtendExpr(Op, Ty, Depth + 1));		Ops.push_back(getZeroExtendExpr(Op, Ty, Depth + 1));
return getAddExpr(Ops, SCEV::FlagNUW, Depth + 1);		return getAddExpr(Ops, SCEV::FlagNUW, Depth + 1);
}		}

		// zext(C + x + y + ...) --> (zext(D) + zext((C - D) + x + y + ...))<nuw>
		// if D + (C - D + x + y + ...) could be proven to not unsigned wrap
		// where D is the maximum such D that D <= C (unsigned)
		//
		// Useful while proving that address arithmetic expressions are equal or
		// differ by a small constant amount, see LoadStoreVectorizer pass.
		if (const auto *SC = dyn_cast<SCEVConstant>(SA->getOperand(0))) {
		APInt MinValue = getUnsignedRangeMin(SA);
		// Often address arithmetics contain expressions like
		// (zext (add (shl X, C1), C2)), for instance, (zext (5 + (4 * X))).
		// ConstantRange is unable to prove that 1 + (4 + 4 * X) doesn't wrap in
		// such cases:
		//
		// \| Expression \| ConstantRange \| KnownBits \|
		// \|------------\|------------------------\|-----------------------\|
		// \| 4 * X \| [L: 0, U: 253) \| XXXX XX00 \|
		// \| \| => Min: 0, Max: 252 \| => Min: 0, Max: 252 \|
		// \| \| \| \|
		// \| 4 * X + 4 \| [L: 4, U: 1) (wrapped) \| YYYY YY00 \|
		// \| \| => Min: 0, Max: 255 \| => Min: 0, Max: 252 \|
		//
		// On the other hand, ConstantRange [L: 1, U: 3) degrades to
		// KnownBits 0000 00XX, loosing Min: 1, Max: 2 to Min: 0, Max: 3.
		// So use both if available:
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Another thing to discuss here is the fact that SCEV appears to be relying on value range analysis implemented via `ConstantRange` instead of `KnownBits`. It appears to me that we could achieve better results if we used both simultaneously updating them properly. See the example above justifying that. Do you think it's worth bringing up at dev mailing list level? rtereshin: Another thing to discuss here is the fact that SCEV appears to be relying on value range…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Now's the adventurous bit: KnownBits is able to prove that C1 + (C2 * 2^n * X) doesn't wrap if C1 < 2^n precisely because KnownBits operates over the arithmetic base of 2. If KnownBits operated over base of 3, for example, we could use it to prove that C1 + (C2 * 3^n * X) doesn't wrap (for instance, u * 3 + 1. Indeed, if bits of 3 are all unknown, KnownBits<base 3> of (u * 3) is XXXX XXX0 and therefore u * 3 + C doesn't wrap for any C <- {0, 1, 2}). I suspect that it could be proven that there is a basis (in linear algebra sense) in the system of KnownBits' that is sufficient: KnownBits over prime numbers. So let's say for every SCEV expression we cache not just ConstantRange, but every non-trivial KnownBits<B>, where B (the base) <- {p1, p2, ..., pK}, p<i> is i-th prime number, K is some reasonable limit, and "non-trivial" means "not all bits (or rather digits) are unknown", and we use that information to effectively restore <nuw>/<nsw> flags where needed. rtereshin: Now's the adventurous bit: KnownBits is able to prove that C1 + (C2 * 2^n * X) doesn't wrap if…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Ah, I just realized that due to the unfortunate fact that (2^4 - 1) is divisible by 3 and 5, KnownBits over base 3 won't allow us to prove that 3u + 1 doesn't wrap, as it very well may. It will allow us to prove though that 3u + 1, 3u + 2, and 3u + 3 are consecutive, but it's probably not as useful as if we could start from 0. Same for base 5. rtereshin: Ah, I just realized that due to the unfortunate fact that (2^4 - 1) is divisible by 3 and 5…
		if (OpV) {
		const DataLayout &DL = getDataLayout();
		KnownBits Known = computeKnownBits(OpV, DL, 0, &AC, nullptr, &DT);
		mkazantsevUnsubmitted Not Done Reply Inline Actions I don't understand why we need this. `computeKnownBits` is used to deduce ranges of SCEVUnknown. All other SCEV nodes are supposed to propagate range information (e.g. range of sum is a range from sum of min to sum of max, and so on). Thus, in theory, we should be able to identify the range of any SCEV correctly, unless we have some missing logic in range calculation. What is `OpV` in the example you're trying to improve, and why SCEV was unable to deduce its range via `getUnsignedRange(getSCEV(OpV))`? mkazantsev: I don't understand why we need this. `computeKnownBits` is used to deduce ranges of SCEVUnknown.
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions What is OpV in the example you're trying to improve One of the examples is given in the comment nearby: // (zext (add (shl X, C1), C2)), for instance, (zext (5 + (4 * X))). // ConstantRange is unable to prove that 1 + (4 + 4 * X) doesn't wrap in // such cases: // // \| Expression \| ConstantRange \| KnownBits \| // \|------------\|------------------------\|-----------------------\| // \| 4 * X \| [L: 0, U: 253) \| XXXX XX00 \| // \| \| => Min: 0, Max: 252 \| => Min: 0, Max: 252 \| // \| \| \| \| // \| 4 * X + 4 \| [L: 4, U: 1) (wrapped) \| YYYY YY00 \| // \| \| => Min: 0, Max: 255 \| => Min: 0, Max: 252 \| see also lldb session running a similar example, also present in `test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll` updated in this patch: 1814 if (OpV) { 1815 const DataLayout &DL = getDataLayout(); 1816 KnownBits Known = computeKnownBits(OpV, DL, 0, &AC, nullptr, &DT); -> 1817 MinValue = Known.One.ugt(MinValue) ? Known.One : MinValue; 1818 } 1819 APInt C = SC->getAPInt(); 1820 APInt D = MinValue.ugt(C) ? C : MinValue; Target 0: (opt) stopped. (lldb) p OpV->dump() %t1 = add i8 %t0, 5 (lldb) p SA->dump() (5 + (4 * %x)) (lldb) p MinValue (llvm::APInt) $1 = { U = { VAL = 0 pVal = 0x0000000000000000 } BitWidth = 8 } (lldb) p Known.One.dump() APInt(8b, 1u 1s) (lldb) p getUnsignedRange(SA) (llvm::ConstantRange) $2 = { Lower = { U = { VAL = 5 pVal = 0x0000000000000005 } BitWidth = 8 } Upper = { U = { VAL = 2 pVal = 0x0000000000000002 } BitWidth = 8 } } and why SCEV was unable to deduce its range via getUnsignedRange(getSCEV(OpV))? As you mentioned, the range information is calculated using the e.g. range of sum is a range from sum of min to sum of max, and so on principle. `ConstantRange` keeps track of min/max boundaries, but it completely loses any periodic information, like "the range contains only values divisible by 4". `KnownBits` behavior is exact opposite: its imprecise when it comes to boundaries, but it keeps track of the periodic information. For instance, the only thing that is known about `4 * x` is that the 2 least significant bits of the value are 0s. From `ConstantRange` perspective it only means that the value doesn't exceed `2^32 - 4` if treated as unsigned `i32`. It's completely unaware of the fact that `4 * x` could not be 7, for instance. If we shift the range by adding 5 (`4 * x + 5`) min/max recomputation of the range leads to the wrapped around range `[5, 2)`, that gives us no useful information about the minimum and maximum values (minimum is `0`, maximum is `2^32 - 1`). While from known bits we know that `4 * x + 5` looks like `XXX...XX01`, therefore the minimum value is `1`, and the maximum value is `2^32 - 3`. Ranges and KnownBits are complementary to each other, neither is more precise than the other in all cases. If we want a value range analysis with good precision, we need to maintain and update both simultaneously. rtereshin: > What is OpV in the example you're trying to improve One of the examples is given in the…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions @mkazantsev I think I see what's the source of the confusion: apparently, the current implementation tries to utilize knownbits-like information in a limited form of "number of trailing zeros", which is computed for `Add` the following way: if (const SCEVAddExpr A = dyn_cast<SCEVAddExpr>(S)) { // The result is the min of all operands results. uint32_t MinOpRes = GetMinTrailingZeros(A->getOperand(0)); for (unsigned i = 1, e = A->getNumOperands(); MinOpRes && i != e; ++i) MinOpRes = std::min(MinOpRes, GetMinTrailingZeros(A->getOperand(i))); return MinOpRes; } https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5375-L5381 So it does work for expressions like `4 + 4 x` (well, sometimes, somehow that kind of information is there for unsigned ranges, but not for signed ranges), and that makes my comment inaccurate. I will change it to `5 + 4 * x` example. For `5 + 4 * x` it doesn't work, of course, as the number of trailing zeroes is 0 (`5 + 4 * x` ~= `XXX...XX01`). rtereshin: @mkazantsev I think I see what's the source of the confusion: apparently, the current…
		MinValue = Known.One.ugt(MinValue) ? Known.One : MinValue;
		}
		APInt C = SC->getAPInt();
		APInt D = MinValue.ugt(C) ? C : MinValue;
		if (D != 0) {
		const SCEV *SZExtD = getZeroExtendExpr(getConstant(D), Ty, Depth);
		const SCEV *SResidual =
		getAddExpr(getConstant(-D), SA, SCEV::FlagAnyWrap, Depth);
		const SCEV *SZExtR = getZeroExtendExpr(SResidual, Ty, Depth + 1);
		return getAddExpr(SZExtD, SZExtR, SCEV::FlagNUW, Depth + 1);
		}
		}
}		}

if (auto *SM = dyn_cast<SCEVMulExpr>(Op)) {		if (auto *SM = dyn_cast<SCEVMulExpr>(Op)) {
// zext((A * B * ...)<nuw>) --> (zext(A) * zext(B) * ...)<nuw>		// zext((A * B * ...)<nuw>) --> (zext(A) * zext(B) * ...)<nuw>
if (SM->hasNoUnsignedWrap()) {		if (SM->hasNoUnsignedWrap()) {
// If the multiply does not unsign overflow then we can, by definition,		// If the multiply does not unsign overflow then we can, by definition,
// commute the zero extension with the multiply operation.		// commute the zero extension with the multiply operation.
SmallVector<const SCEV *, 4> Ops;		SmallVector<const SCEV *, 4> Ops;
▲ Show 20 Lines • Show All 4,563 Lines • ▼ Show 20 Lines	if (auto BO = MatchBinaryOp(U, DT)) {
}		}
}		}
}		}

switch (U->getOpcode()) {		switch (U->getOpcode()) {
case Instruction::Trunc:		case Instruction::Trunc:
return getTruncateExpr(getSCEV(U->getOperand(0)), U->getType());		return getTruncateExpr(getSCEV(U->getOperand(0)), U->getType());

case Instruction::ZExt:		case Instruction::ZExt: {
return getZeroExtendExpr(getSCEV(U->getOperand(0)), U->getType());		Value *OpV = U->getOperand(0);
		return getZeroExtendExprForValue(OpV, getSCEV(OpV), U->getType());
		}

case Instruction::SExt:		case Instruction::SExt:
if (auto BO = MatchBinaryOp(U->getOperand(0), DT)) {		if (auto BO = MatchBinaryOp(U->getOperand(0), DT)) {
// The NSW flag of a subtract does not always survive the conversion to		// The NSW flag of a subtract does not always survive the conversion to
// A + (-1)*B. By pushing sign extension onto its operands we are much		// A + (-1)*B. By pushing sign extension onto its operands we are much
// more likely to preserve NSW and allow later AddRec optimisations.		// more likely to preserve NSW and allow later AddRec optimisations.
//		//
// NOTE: This is effectively duplicating this logic from getSignExtend:		// NOTE: This is effectively duplicating this logic from getSignExtend:
▲ Show 20 Lines • Show All 5,852 Lines • Show Last 20 Lines

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll

	Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines

	; CHECK: %q0.zext = zext i8 %q0 to i16			; CHECK: %q0.zext = zext i8 %q0 to i16
	; CHECK-NEXT: --> (zext i8 (1 + %len_norange) to i16) U: [0,256) S: [0,256)			; CHECK-NEXT: --> (zext i8 (1 + %len_norange) to i16) U: [0,256) S: [0,256)
	; CHECK: %q1.zext = zext i8 %q1 to i16			; CHECK: %q1.zext = zext i8 %q1 to i16
	; CHECK-NEXT: --> (zext i8 (2 + %len_norange) to i16) U: [0,256) S: [0,256)			; CHECK-NEXT: --> (zext i8 (2 + %len_norange) to i16) U: [0,256) S: [0,256)

	ret void			ret void
	}			}

				@z_addr = external global [16 x i8], align 4
				@z_addr_noalign = external global [16 x i8]

				%union = type { [10 x [4 x float]] }
				@tmp_addr = external unnamed_addr global { %union, [2000 x i8] }

				define void @f3(i8* %x_addr, i8* %y_addr, i32* %tmp_addr) {
				; CHECK-LABEL: Classifying expressions for: @f3
				entry:
				%x = load i8, i8* %x_addr
				%t0 = mul i8 %x, 4
				%t1 = add i8 %t0, 5
				%t1.zext = zext i8 %t1 to i16
				; CHECK: %t1.zext = zext i8 %t1 to i16
				; CHECK-NEXT: --> (1 + (zext i8 (4 + (4 * %x)) to i16))<nuw><nsw> U: [1,254) S: [1,257)

				%q0 = mul i8 %x, 4
				%q1 = add i8 %q0, 7
				%q1.zext = zext i8 %q1 to i16
				; CHECK: %q1.zext = zext i8 %q1 to i16
				; CHECK-NEXT: --> (3 + (zext i8 (4 + (4 * %x)) to i16))<nuw><nsw> U: [3,256) S: [3,259)

				%p0 = mul i8 %x, 4
				%p1 = add i8 %p0, 8
				%p1.zext = zext i8 %p1 to i16
				; CHECK: %p1.zext = zext i8 %p1 to i16
				; CHECK-NEXT: --> (zext i8 (8 + (4 * %x)) to i16) U: [0,253) S: [0,256)
				mkazantsevUnsubmitted Not Done Reply Inline Actions Its weird. Why signed and unsigned ranges are different? mkazantsev: Its weird. Why signed and unsigned ranges are different?
				rtereshinAuthorUnsubmitted Not Done Reply Inline Actions That's a good question. One thing I know is that the issue is orthogonal to this patch and exists on trunk: %p1.zext = zext i8 %p1 to i16 --> (zext i8 (8 + (4 * %x)) to i16) U: [0,253) S: [0,256) (this is w/o this patch applied) Perhaps unsigned range takes some knownbits-like information into account, while signed one doesn't. rtereshin: That's a good question. One thing I know is that the issue is orthogonal to this patch and…
				rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Maybe this is the spot: https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5594-L5613 rtereshin: Maybe this is the spot: https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2…

				%r0 = mul i8 %x, 4
				%r1 = add i8 %r0, 254
				%r1.zext = zext i8 %r1 to i16
				; CHECK: %r1.zext = zext i8 %r1 to i16
				; CHECK-NEXT: --> (2 + (zext i8 (-4 + (4 * %x)) to i16))<nuw><nsw> U: [2,255) S: [2,258)

				%y = load i8, i8* %y_addr
				%s0 = mul i8 %x, 32
				%s1 = mul i8 %y, 36
				%s2 = add i8 %s0, %s1
				%s3 = add i8 %s2, 5
				%s3.zext = zext i8 %s3 to i16
				; CHECK: %s3.zext = zext i8 %s3 to i16
				; CHECK-NEXT: --> (1 + (zext i8 (4 + (32 * %x) + (36 * %y)) to i16))<nuw><nsw> U: [1,254) S: [1,257)

				%ptr = bitcast [16 x i8]* @z_addr to i8*
				%int0 = ptrtoint i8* %ptr to i32
				%int5 = add i32 %int0, 5
				%int.zext = zext i32 %int5 to i64
				; CHECK: %int.zext = zext i32 %int5 to i64
				; CHECK-NEXT: --> (1 + (zext i32 (4 + %int0) to i64))<nuw><nsw> U: [1,4294967294) S: [1,4294967297)

				%ptr_noalign = bitcast [16 x i8]* @z_addr_noalign to i8*
				%int0_na = ptrtoint i8* %ptr_noalign to i32
				%int5_na = add i32 %int0_na, 5
				%int.zext_na = zext i32 %int5_na to i64
				; CHECK: %int.zext_na = zext i32 %int5_na to i64
				; CHECK-NEXT: --> (zext i32 (5 + %int0_na) to i64) U: [0,4294967296) S: [0,4294967296)

				%tmp = load i32, i32* %tmp_addr
				%mul = and i32 %tmp, -4
				%add4 = add i32 %mul, 4
				%add4.zext = zext i32 %add4 to i64
				%sunkaddr3 = mul i64 %add4.zext, 4
				%sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @tmp_addr to i8*), i64 %sunkaddr3
				%sunkaddr5 = getelementptr inbounds i8, i8* %sunkaddr4, i64 4096
				%addr4.cast = bitcast i8* %sunkaddr5 to i32*
				%addr4.incr = getelementptr i32, i32* %addr4.cast, i64 1
				; CHECK: %addr4.incr = getelementptr i32, i32* %addr4.cast, i64 1
				; CHECK-NEXT: --> ([[C:4100]] + ([[SIZE:4]] * (zext i32 ([[OFFSET:4]] + ([[STRIDE:4]] * (%tmp /u [[STRIDE]]))) to i64)) + @tmp_addr)

				%add5 = add i32 %mul, 5
				%add5.zext = zext i32 %add5 to i64
				%sunkaddr0 = mul i64 %add5.zext, 4
				%sunkaddr1 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @tmp_addr to i8*), i64 %sunkaddr0
				%sunkaddr2 = getelementptr inbounds i8, i8* %sunkaddr1, i64 4096
				%addr5.cast = bitcast i8* %sunkaddr2 to i32*
				; CHECK: %addr5.cast = bitcast i8* %sunkaddr2 to i32*
				; CHECK-NEXT: --> ([[C]] + ([[SIZE]] * (zext i32 ([[OFFSET]] + ([[STRIDE]] * (%tmp /u [[STRIDE]]))) to i64)) + @tmp_addr)

				ret void
				}

test/Transforms/LoadStoreVectorizer/X86/codegenprepare-produced-address-math.ll

This file was added.

				; RUN: opt -codegenprepare -load-store-vectorizer %s -S -o - \| FileCheck %s
				; RUN: opt -load-store-vectorizer %s -S -o - \| FileCheck %s

				target triple = "x86_64--"

				%union = type { { [4 x [4 x [4 x [16 x float]]]], [4 x [4 x [4 x [16 x float]]]], [10 x [10 x [4 x float]]] } }

				@global_pointer = external unnamed_addr global { %union, [2000 x i8] }, align 4

				; Function Attrs: convergent nounwind
				define void @test(i32 %base) #0 {
				; CHECK-LABEL: @test(
				; CHECK-NOT: load i32
				; CHECK: load <2 x i32>
				; CHECK-NOT: load i32
				entry:
				%mul331 = and i32 %base, -4
				%add350.4 = add i32 4, %mul331
				%idx351.4 = zext i32 %add350.4 to i64
				%arrayidx352.4 = getelementptr inbounds { %union, [2000 x i8] }, { %union, [2000 x i8] }* @global_pointer, i64 0, i32 0, i32 0, i32 1, i64 0, i64 0, i64 0, i64 %idx351.4
				%tmp296.4 = bitcast float* %arrayidx352.4 to i32*
				%add350.5 = add i32 5, %mul331
				%idx351.5 = zext i32 %add350.5 to i64
				%arrayidx352.5 = getelementptr inbounds { %union, [2000 x i8] }, { %union, [2000 x i8] }* @global_pointer, i64 0, i32 0, i32 0, i32 1, i64 0, i64 0, i64 0, i64 %idx351.5
				%tmp296.5 = bitcast float* %arrayidx352.5 to i32*
				%cnd = icmp ult i32 %base, 1000
				br i1 %cnd, label %loads, label %exit

				loads:
				; If and only if the loads are in a different BB from the GEPs codegenprepare
				; would try to turn the GEPs into math, which makes LoadStoreVectorizer's job
				; harder
				%tmp297.4 = load i32, i32* %tmp296.4, align 4, !tbaa !0
				%tmp297.5 = load i32, i32* %tmp296.5, align 4, !tbaa !0
				br label %exit

				exit:
				ret void
				}

				; Function Attrs: convergent nounwind
				define void @test.codegenprepared(i32 %base) #0 {
				; CHECK-LABEL: @test.codegenprepared(
				; CHECK-NOT: load i32
				; CHECK: load <2 x i32>
				; CHECK-NOT: load i32
				entry:
				%mul331 = and i32 %base, -4
				%add350.4 = add i32 4, %mul331
				%idx351.4 = zext i32 %add350.4 to i64
				%add350.5 = add i32 5, %mul331
				%idx351.5 = zext i32 %add350.5 to i64
				%cnd = icmp ult i32 %base, 1000
				br i1 %cnd, label %loads, label %exit

				loads: ; preds = %entry
				%sunkaddr = mul i64 %idx351.4, 4
				%sunkaddr1 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8*), i64 %sunkaddr
				%sunkaddr2 = getelementptr inbounds i8, i8* %sunkaddr1, i64 4096
				%0 = bitcast i8* %sunkaddr2 to i32*
				%tmp297.4 = load i32, i32* %0, align 4, !tbaa !0
				%sunkaddr3 = mul i64 %idx351.5, 4
				%sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8*), i64 %sunkaddr3
				%sunkaddr5 = getelementptr inbounds i8, i8* %sunkaddr4, i64 4096
				%1 = bitcast i8* %sunkaddr5 to i32*
				%tmp297.5 = load i32, i32* %1, align 4, !tbaa !0
				br label %exit

				exit: ; preds = %loads, %entry
				ret void
				}

				attributes #0 = { convergent nounwind }

				!0 = !{!1, !1, i64 0}
				!1 = !{!"float", !2, i64 0}
				!2 = !{!"omnipotent char", !3, i64 0}
				!3 = !{!"Simple C++ TBAA"}

This is an archive of the discontinued LLVM Phabricator instance.

[SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transformClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 153796

include/llvm/Analysis/ScalarEvolution.h

lib/Analysis/ScalarEvolution.cpp

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll

test/Transforms/LoadStoreVectorizer/X86/codegenprepare-produced-address-math.ll

[SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
ClosedPublic