This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Analysis/
-
Analysis/
18
ScalarEvolution.cpp
-
test/
-
Analysis/ScalarEvolution/
-
ScalarEvolution/
3
no-wrap-add-exprs.ll
-
Transforms/
-
IndVarSimplify/
-
shrunk-constant.ll
-
LoadStoreVectorizer/X86/
-
X86/
-
codegenprepare-produced-address-math.ll
-
SLPVectorizer/X86/
-
X86/
-
consecutive-access.ll

Differential D48853

[SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
ClosedPublic

Authored by rtereshin on Jul 2 2018, 2:53 PM.

Download Raw Diff

Details

Reviewers

sanjoy
mzolotukhin
volkan
efriedma

Commits

rGed047b018430: [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
rG1ba1f9310c26: [SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw><nsw> transform
rL337943: [SCEV] Add [zs]ext{C,+,x} -> (D + [zs]ext{C-D,+,x})<nuw><nsw> transform
rL337859: [SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw><nsw> transform

Summary

as well as [zs]ext(C + x + ...) -> (D + [zs]ext(C-D + x + ...))<nuw><nsw>

if the top level addition in (D + (C-D + x * n)) could be proven to
not wrap, where the choice of D also maximizes the number of trailing
zeroes of (C-D + x * n), ensuring homogeneous behaviour of the
transformation and better canonicalization of such AddRec's

(indeed, there are 2^(2w) different expressions in B1 + ext(B2 + Y) form for
the same Y, but only 2^(2w - k) different expressions in the resulting
B3 + ext((B4 * 2^k) + Y) form, where w is the bit width of the integral type)

The AddExpr version of the transformation enables better canonicalization
of expressions like

1 + zext(5 + 20 * %x + 24 * %y)  and
    zext(6 + 20 * %x + 24 * %y)

which get both transformed to

2 + zext(4 + 20 * %x + 24 * %y)

This pattern is common in address arithmetics and the transformation
makes it easier for passes like LoadStoreVectorizer to prove that 2 or
more memory accesses are consecutive and optimize (vectorize) them.

I found this change similar to a number of other changes to Scalar Evolution, namely:

commit 63c52aea76b530d155ec6913d5c3bbe1ecd82ad8
Author: Sanjoy Das <sanjoy@playingwithpointers.com>
Date:   Thu Oct 22 19:57:38 2015 +0000

    [SCEV] Commute zero extends through <nuw> additions

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@251052 91177308-0d34-0410-b5e6-96231b3b80d8

commit 3edd5bf90828613bacfdc2ce047d3776363123e5
Author: Justin Lebar <jlebar@google.com>
Date:   Thu Jun 14 17:13:48 2018 +0000

    [SCEV] Simplify zext/trunc idiom that appears when handling bitmasks.

    Summary:
    Specifically, we transform

      zext(2^K * (trunc X to iN)) to iM ->
      2^K * (zext(trunc X to i{N-K}) to iM)<nuw>

    This is helpful because pulling the 2^K out of the zext allows further
    optimizations.

    Reviewers: sanjoy

    Subscribers: hiraditya, llvm-commits, timshen

    Differential Revision: https://reviews.llvm.org/D48158

and the most relevant

commit 45788be6e2603ecfc149f43df1a6d5e04c5734d8
Author: Michael Zolotukhin <mzolotukhin@apple.com>
Date:   Sat May 24 08:09:57 2014 +0000

    Implement sext(C1 + C2*X) --> sext(C1) + sext(C2*X) and
    sext{C1,+,C2} --> sext(C1) + sext{0,+,C2} transformation in Scalar
    Evolution.

    That helps SLP-vectorizer to recognize consecutive loads/stores.

    <rdar://problem/14860614>

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@209568 91177308-0d34-0410-b5e6-96231b3b80d8

This patch generalizes the latter one by relaxing the requirements the following way:

C2 doesn't have to be a power of 2, it enough if it's divisible by 2 a sufficient number of times;
C1 doesn't have to be less than C2, instead of extracting the entire C1 we can split it into 2 terms: (00...0XXX + YY...Y000), keep the second one that may cause wrapping within the extension operator, and move the first one that doesn't affect wrapping out of the extension operator, enabling further simplifications;
C1 and C2 don't have to be positive, splitting C1 like shown above produces a sum that is guaranteed to not wrap, signed or unsigned;
in AddExpr case there could be more than 2 terms, and in case of AddExpr the 2nd and following terms and in case of AddRecExpr the Step component don't have to be in the C2*X form or constant (respectively), they just need to have enough trailing zeros, which in turn could be guaranteed by means other than arithmetics, e.g. by a pointer alignment;
the extension operator doesn't have to be a sext, the same transformation works and profitable for zext's as well.

Apparently, optimizations like SLPVectorizer currently fail to
vectorize even rather trivial cases like the following:

double bar(double *a, unsigned n) {

double x = 0.0;
double y = 0.0;
for (unsigned i = 0; i < n; i += 2) {
  x += a[i];
  y += a[i + 1];
}
return x * y;

}

If compiled with clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm
(!{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"})

it produces scalar code with the loop not unrolled with the unsigned n and
i (like shown above), but vectorized and unrolled loop with signed n and
i (follow https://godbolt.org/g/nq9xF8 to play with it).

With the changes made in this commit the unsigned version will be
vectorized (though not unrolled for unclear reasons).

Diff Detail

Repository: rL LLVM

Event Timeline

rtereshin created this revision.Jul 2 2018, 2:53 PM

Herald added subscribers: javed.absar, tpr. · View Herald TranscriptJul 2 2018, 2:53 PM

rtereshin added inline comments.Jul 2 2018, 3:03 PM

lib/Analysis/ScalarEvolution.cpp
1858	Another thing to discuss here is the fact that SCEV appears to be relying on value range analysis implemented via `ConstantRange` instead of `KnownBits`. It appears to me that we could achieve better results if we used both simultaneously updating them properly. See the example above justifying that. Do you think it's worth bringing up at dev mailing list level?

rtereshin added inline comments.Jul 2 2018, 3:49 PM

lib/Analysis/ScalarEvolution.cpp
1858	Now's the adventurous bit: KnownBits is able to prove that C1 + (C2 * 2^n * X) doesn't wrap if C1 < 2^n precisely because KnownBits operates over the arithmetic base of 2. If KnownBits operated over base of 3, for example, we could use it to prove that C1 + (C2 * 3^n * X) doesn't wrap (for instance, u * 3 + 1. Indeed, if bits of 3 are all unknown, KnownBits<base 3> of (u * 3) is XXXX XXX0 and therefore u * 3 + C doesn't wrap for any C <- {0, 1, 2}). I suspect that it could be proven that there is a basis (in linear algebra sense) in the system of KnownBits' that is sufficient: KnownBits over prime numbers. So let's say for every SCEV expression we cache not just ConstantRange, but every non-trivial KnownBits<B>, where B (the base) <- {p1, p2, ..., pK}, p<i> is i-th prime number, K is some reasonable limit, and "non-trivial" means "not all bits (or rather digits) are unknown", and we use that information to effectively restore <nuw>/<nsw> flags where needed.

Could you use ScalarEvolution::GetMinTrailingZeros instead of calling computeKnownBits directly?

In D48853#1150253, @efriedma wrote:

Could you use ScalarEvolution::GetMinTrailingZeros instead of calling computeKnownBits directly?

I don't think so. I need to know the maximum unsigned number I can safely subtract from an expression (w/o wrapping). Number of trailing zeros is of no use for this, I think.
GetMinTrailingOnes, if it existed, wouldn't help much either: let's say, the expression is x * 8 + 6. The number of trailing ones here is 0, though we can safely subtract any number from 0 to 6 (inclusive) (KnownBits are XXXX X110).

To further entertain the idea, let's notice that the upper limit is not always the constant part of the expression. If it's x * 4 + 6 we can only subtract 0, 1, or 2.
Of course, we could use GetMinTrailingZeros(x * 4) = 2, turn this into a 0000 0011 mask, apply the mask to the constant part (0000 0110), get 10 (2) as the maximum safe subtrahend.

But what if there is more than 2 terms? For instance, the expression is 6 + 20 * %x + 24 * %y. Do I need to rebuild an add from 20 * %x and 24 * %y just to apply GetMinTrailingZeros to the result?

Or I could extract the part of the GetMinTrailingZeros that handles add as a separate method so it could be used here. Do you think any of it is a better solution that applying computeKnownBits directly?

rtereshin added inline comments.Jul 3 2018, 9:41 AM

lib/Analysis/ScalarEvolution.cpp
1858	Ah, I just realized that due to the unfortunate fact that (2^4 - 1) is divisible by 3 and 5, KnownBits over base 3 won't allow us to prove that 3u + 1 doesn't wrap, as it very well may. It will allow us to prove though that 3u + 1, 3u + 2, and 3u + 3 are consecutive, but it's probably not as useful as if we could start from 0. Same for base 5.

ping

mkazantsev added inline comments.Jul 9 2018, 10:23 PM

lib/Analysis/ScalarEvolution.cpp
1861	I don't understand why we need this. `computeKnownBits` is used to deduce ranges of SCEVUnknown. All other SCEV nodes are supposed to propagate range information (e.g. range of sum is a range from sum of min to sum of max, and so on). Thus, in theory, we should be able to identify the range of any SCEV correctly, unless we have some missing logic in range calculation. What is `OpV` in the example you're trying to improve, and why SCEV was unable to deduce its range via `getUnsignedRange(getSCEV(OpV))`?

mkazantsev added inline comments.Jul 9 2018, 10:25 PM

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150	Its weird. Why signed and unsigned ranges are different?

rtereshin added inline comments.Jul 9 2018, 11:12 PM

lib/Analysis/ScalarEvolution.cpp
1861	What is OpV in the example you're trying to improve One of the examples is given in the comment nearby: // (zext (add (shl X, C1), C2)), for instance, (zext (5 + (4 * X))). // ConstantRange is unable to prove that 1 + (4 + 4 * X) doesn't wrap in // such cases: // // \| Expression \| ConstantRange \| KnownBits \| // \|------------\|------------------------\|-----------------------\| // \| 4 * X \| [L: 0, U: 253) \| XXXX XX00 \| // \| \| => Min: 0, Max: 252 \| => Min: 0, Max: 252 \| // \| \| \| \| // \| 4 * X + 4 \| [L: 4, U: 1) (wrapped) \| YYYY YY00 \| // \| \| => Min: 0, Max: 255 \| => Min: 0, Max: 252 \| see also lldb session running a similar example, also present in `test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll` updated in this patch: 1814 if (OpV) { 1815 const DataLayout &DL = getDataLayout(); 1816 KnownBits Known = computeKnownBits(OpV, DL, 0, &AC, nullptr, &DT); -> 1817 MinValue = Known.One.ugt(MinValue) ? Known.One : MinValue; 1818 } 1819 APInt C = SC->getAPInt(); 1820 APInt D = MinValue.ugt(C) ? C : MinValue; Target 0: (opt) stopped. (lldb) p OpV->dump() %t1 = add i8 %t0, 5 (lldb) p SA->dump() (5 + (4 * %x)) (lldb) p MinValue (llvm::APInt) $1 = { U = { VAL = 0 pVal = 0x0000000000000000 } BitWidth = 8 } (lldb) p Known.One.dump() APInt(8b, 1u 1s) (lldb) p getUnsignedRange(SA) (llvm::ConstantRange) $2 = { Lower = { U = { VAL = 5 pVal = 0x0000000000000005 } BitWidth = 8 } Upper = { U = { VAL = 2 pVal = 0x0000000000000002 } BitWidth = 8 } } and why SCEV was unable to deduce its range via getUnsignedRange(getSCEV(OpV))? As you mentioned, the range information is calculated using the e.g. range of sum is a range from sum of min to sum of max, and so on principle. `ConstantRange` keeps track of min/max boundaries, but it completely loses any periodic information, like "the range contains only values divisible by 4". `KnownBits` behavior is exact opposite: its imprecise when it comes to boundaries, but it keeps track of the periodic information. For instance, the only thing that is known about `4 * x` is that the 2 least significant bits of the value are 0s. From `ConstantRange` perspective it only means that the value doesn't exceed `2^32 - 4` if treated as unsigned `i32`. It's completely unaware of the fact that `4 * x` could not be 7, for instance. If we shift the range by adding 5 (`4 * x + 5`) min/max recomputation of the range leads to the wrapped around range `[5, 2)`, that gives us no useful information about the minimum and maximum values (minimum is `0`, maximum is `2^32 - 1`). While from known bits we know that `4 * x + 5` looks like `XXX...XX01`, therefore the minimum value is `1`, and the maximum value is `2^32 - 3`. Ranges and KnownBits are complementary to each other, neither is more precise than the other in all cases. If we want a value range analysis with good precision, we need to maintain and update both simultaneously.

rtereshin added inline comments.Jul 9 2018, 11:42 PM

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150	That's a good question. One thing I know is that the issue is orthogonal to this patch and exists on trunk: %p1.zext = zext i8 %p1 to i16 --> (zext i8 (8 + (4 * %x)) to i16) U: [0,253) S: [0,256) (this is w/o this patch applied) Perhaps unsigned range takes some knownbits-like information into account, while signed one doesn't.

rtereshin added inline comments.Jul 9 2018, 11:53 PM

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll
150	Maybe this is the spot: https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5594-L5613

rtereshin added inline comments.Jul 10 2018, 12:11 AM

lib/Analysis/ScalarEvolution.cpp
1861	@mkazantsev I think I see what's the source of the confusion: apparently, the current implementation tries to utilize knownbits-like information in a limited form of "number of trailing zeros", which is computed for `Add` the following way: if (const SCEVAddExpr A = dyn_cast<SCEVAddExpr>(S)) { // The result is the min of all operands results. uint32_t MinOpRes = GetMinTrailingZeros(A->getOperand(0)); for (unsigned i = 1, e = A->getNumOperands(); MinOpRes && i != e; ++i) MinOpRes = std::min(MinOpRes, GetMinTrailingZeros(A->getOperand(i))); return MinOpRes; } https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5375-L5381 So it does work for expressions like `4 + 4 x` (well, sometimes, somehow that kind of information is there for unsigned ranges, but not for signed ranges), and that makes my comment inaccurate. I will change it to `5 + 4 * x` example. For `5 + 4 * x` it doesn't work, of course, as the number of trailing zeroes is 0 (`5 + 4 * x` ~= `XXX...XX01`).

Updated the comment from a misleading 4 + 4 * x example to a correct 5 + 4 * x one.

ping

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

In D48853#1165748, @efriedma wrote:

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

I agree, that would be a more generic and homogeneous solution. Using (ConstantRange, KnownBits) pair instead of (ConstantRange, minTrailingZeros) (let alone only one component of the latter pair) across Scalar Evolution consistently may also benefit the framework in a number of other ways. It's a more intrusive change though. For now I think I could try to go with the number of trailing zeros approach despite the loss in generality if you or community feel strongly against known bits used the way the are used now in this patch.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

I think this transformation reduces the number of possible operands of a zext, so it brings some of the expressions in C1 + zext(C2 + X) form to the same shape - often C3 + zext(C4 * 2^k + X) - which is canonicalization (if some of the constants are missing let's say they are just zeroes). There is 2^(2w) different pairs of constants (C1, C2), and only 2^(2w - k) different pair of (C3, C4 ^ 2^k), where w is the bit width of the type.

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

I suppose I should, what would you suggest?

I've moved away from using KnownBits, extended the proposed transformation to AddRecs, generalized it for signed extensions as well, and unified it all with pre-existing sext-only transformations that handle a strict subset of cases.

There are 2 separate commits here planned, first non-intrusively adds zext(C + x + ...) -> (D + zext(C-D + x + ...))<nuw><nsw> transformation only (in it's no-KnownBits / no-API-changes version that could be seen in this patch) along with the tests from the initial version of this patch (mostly LoadStoreVectorizer-related), while the second commit brings the rest (as well as adds SLPVectorizer-targeting tests).

Hopefully this is better now.

Herald added a subscriber: dmgreen. · View Herald TranscriptJul 18 2018, 11:40 AM

rtereshin added a reviewer: efriedma.Jul 19 2018, 1:02 PM

rtereshin added a subscriber: bogner.

Hi,

Thanks for working on this! From the description the approach looks correct and promising. I glanced through the patch, and it looked good, but if you don't mind I'd like to have another look.

Thanks,
Michael

lib/Analysis/ScalarEvolution.cpp
1566–1568	Could you please also add a description for `FullExpr`? It might be helpful to add even more examples here and describe the intended use (e.g. `ConstantTerm` is `Start` and `FullExpr` is `Step` of an `AddRec` expression).
1585–1588	Just checking my understanding: we're basically finding the largest common denominator here, which is also a power of 2, right?

rtereshin added inline comments.Jul 19 2018, 3:48 PM

lib/Analysis/ScalarEvolution.cpp
1566–1568	Thanks for looking into this! Could you please also add a description for FullExpr? Sure, will do. It might be helpful to add even more examples here and describe the intended use (e.g. ConstantTerm is Start and FullExpr is Step of an AddRec expression). The next overload is the one that handles AddRec with parameter names being `ConstantStart` and `Step`. Do you think the names are self-explanatory or I need to elaborate in a comment? As for AddExpr-version here I'm going to elaborate what `FullExpr` is and hopefully that will be clear enough.
1585–1588	we're basically finding the largest common denominator here, which is also a power of 2, right? This uint32_t TZ = BitWidth; for (unsigned I = 1, E = FullExpr->getNumOperands(); I < E && TZ; ++I) TZ = std::min(TZ, SE.GetMinTrailingZeros(FullExpr->getOperand(I))) piece effectively does exactly that, yes. Another way to look at it is to say that we have an `AddExpr` that looks like `(C + x + y + ...)`, where `C` is a constant and x, y, ... are arbitrary SCEVs, and we're computing the minimum number of trailing zeroes guaranteed of the sum w/o the constant term: `(x + y + ...)`. If, for example, those terms look like follows: i XXXX...X000 YYYY...YY00 ... ZZZZ...0000 then the rightmost non-guaranteed zero bit (a potential one at i-th position above) can change the bits of the sum to the left, but it can not possibly change the bits to the right. So we can compute the number of trailing zeroes by taking a minimum between the numbers of trailing zeroes of the terms. Now let's say that our original sum with the constant is effectively just `C + X`, where `X = x + y + ...`. Let's say we've got 2 guaranteed trailing zeros for `X`: j CCCC...CCCC XXXX...XX00 Any bit of `C` to the left of `j` may in the end cause the `C + X` sum to wrap, but the rightmost 2 bits of `C` (at positions `j` and `j - 1`) do not affect wrapping in any way. If the upper bits cause a wrap, it will be a wrap regardless of the values of the 2 least significant bits of `C`. If the upper bits do not cause a wrap, it won't be a wrap regardless of the values of the 2 bits on the right (again). So let's split C to 2 constants like follows: 0000...00CC = D CCCC...CC00 = (C - D) and the whole sum like `D + (C - D + X)`. The second term of this new sum looks like this: CCCC...CC00 XXXX...XX00 ----------- YYYY...YY00 The sum above (let's call it `Y`)) may or may not wrap, we don't know, so we need to keep it under a sext/zext. Adding `D` to that sum though will never wrap, signed or unsigned, if performed on the original bit width or the extended one, because all that that final add does is setting the 2 least significant bits of `Y` to the bits of `D`: YYYY...YY00 = Y 0000...00CC = D ----------- <nuw><nsw> YYYY...YYCC Which means we can safely move that D out of the sext or zext and claim that the top-level sum neither sign wraps nor unsigned wraps. Let's run an example, let's say we're working in `i8`s and the original expression (zext's or sext's operand) is `21 + 12x + 8y`. So it goes like this: 0001 0101 // 21 XXXX XX00 // 12x YYYY Y000 // 8y 0001 0101 // 21 ZZZZ ZZ00 // 12x + 8y // true, alternatively one can say that gcd(12, 8) is guaranteed to have 2 zeroes on the right 0000 0001 // D 0001 0100 // 21 - D = 20 ZZZZ ZZ00 // 12x + 8y 0000 0001 // D WWWW WW00 // 21 - D + 12x + 8y = 20 + 12x + 8y therefore `zext(21 + 12x + 8y)` = `(1 + zext(20 + 12x + 8y)<nuw><nsw>`

mzolotukhin added inline comments.Jul 19 2018, 4:05 PM

lib/Analysis/ScalarEvolution.cpp
1566–1568	Thanks! I'm fine with whatever way you choose, I'd just like to see some example/description of what `FullExpr` should be. For instance, I spent some time thinking that `FullExpr` would be, e.g. `3 + 4x + 6y` (which was inspired by your examples) and I couldn't understand how you can ever get TZ not equal to 0. It all made sense in the end, but saying that `ConstantStart` is 3 and `FullExpr` is `4x + 6y` would've saved me some time :)
1585–1588	Thanks for the great explanation! I think it's worth having it or its shorter version somewhere in comments. And just to be clear: I think that the patch is already very well-commented (thank you for that!), my remarks are just nit-picks.

rtereshin added inline comments.Jul 19 2018, 4:24 PM

lib/Analysis/ScalarEvolution.cpp
1566–1568	Thing is, you were right, `SCEVAddExpr FullExpr` is `3 + 4x + 6y`. It's the iteration (`for (unsigned I = 1,...`) that goes from operand 1 instead of operand 0. I feel like it would be a little trickier to ask clients of this function to provide a reference (a pair of operand iterators, for instance) to `4x + 6*y`.
1585–1588	You are welcome! Hm... Do you think it could be better if I just put this as is in the commit message instead? If someone goes curious, they git blame and see a detailed explanation attributed to the exact version of the code that that explanation describes? This way we don't have a really huge comment in code that will most certainly get out of synch with the implementation at some point in the future.

mzolotukhin added inline comments.Jul 19 2018, 4:35 PM

lib/Analysis/ScalarEvolution.cpp
1566–1568	Right, we start from 1! Anyways, some note explaining what's going on with `FullExpr` should help here.
1585–1588	Do you think it could be better if I just put this as is in the commit message instead? Yeah, that's a good idea.

Updated and added comments as requested, renamed FullExpr to WholeAddExpr, and planned to add the following piece to the end of the commit message:

How it all works:

Let say we have an AddExpr that looks like (C + x + y + ...), where C
is a constant and x, y, ... are arbitrary SCEVs. Let's compute the
minimum number of trailing zeroes guaranteed of that sum w/o the
constant term: (x + y + ...). If, for example, those terms look like
follows:

        i
XXXX...X000
YYYY...YY00
   ...
ZZZZ...0000

then the rightmost non-guaranteed-zero bit (a potential one at i-th
position above) can change the bits of the sum to the left (and at
i-th position itself), but it can not possibly change the bits to the
right. So we can compute the number of trailing zeroes by taking a
minimum between the numbers of trailing zeroes of the terms.

Now let's say that our original sum with the constant is effectively
just C + X, where X = x + y + .... Let's also say that we've got 2
guaranteed trailing zeros for X:

        j
CCCC...CCCC
XXXX...XX00  // this is X = (x + y + ...)

Any bit of C to the left of j may in the end cause the C + X sum to
wrap, but the rightmost 2 bits of C (at positions j and j - 1) do not
affect wrapping in any way. If the upper bits cause a wrap, it will be
a wrap regardless of the values of the 2 least significant bits of C.
If the upper bits do not cause a wrap, it won't be a wrap regardless
of the values of the 2 bits on the right (again).

So let's split C to 2 constants like follows:

0000...00CC  = D
CCCC...CC00  = (C - D)

and represent the whole sum as D + (C - D + X). The second term of
this new sum looks like this:

CCCC...CC00
XXXX...XX00
-----------  // let's add them up
YYYY...YY00

The sum above (let's call it Y)) may or may not wrap, we don't know,
so we need to keep it under a sext/zext. Adding D to that sum though
will never wrap, signed or unsigned, if performed on the original bit
width or the extended one, because all that that final add does is
setting the 2 least significant bits of Y to the bits of D:

YYYY...YY00 = Y
0000...00CC = D
-----------  <nuw><nsw>
YYYY...YYCC

Which means we can safely move that D out of the sext or zext and
claim that the top-level sum neither sign wraps nor unsigned wraps.

Let's run an example, let's say we're working in i8's and the original
expression (zext's or sext's operand) is 21 + 12x + 8y. So it goes
like this:

0001 0101  // 21
XXXX XX00  // 12x
YYYY Y000  // 8y

0001 0101  // 21
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
0001 0100  // 21 - D = 20
ZZZZ ZZ00  // 12x + 8y

0000 0001  // D
WWWW WW00  // 21 - D + 12x + 8y = 20 + 12x + 8y

therefore zext(21 + 12x + 8y) = (1 + zext(20 + 12x + 8y))<nuw><nsw>

I've looked at the patch one more time more carefully, it looks good to me!

Thanks,
Michael

lib/Analysis/ScalarEvolution.cpp
1570–1579	Maybe move that comment to the explanation in the commit message too? It's not obvious what `ConstantRange` has to with the code around (I understand where it comes from, but I have the full context of the patch now - for someone browsing through the code later it won't be clear why we mention `ConstantRange` here).

This revision is now accepted and ready to land.Jul 19 2018, 5:54 PM

rtereshin added inline comments.Jul 19 2018, 6:29 PM

lib/Analysis/ScalarEvolution.cpp
1570–1579	Yeah, it felt like out of place for a while now, that's a good suggestion, will do. Thanks for accepting the patch!

Removed an out of place comment (the example comparing ConstantRange and KnownBits)

In D48853#1165748, @efriedma wrote:

I think if we're going to do this, we need to implement it on top of a SCEV-based known-bits implementation; introducing a separate getZeroExtendExprForValue API is going to lead to weird results if SCEV creates a zero-extend expression for some other reason.

Whether we should do this in general, I'm not really sure. I mean, yes, I can see how this particular form is a bit more convenient for the load-store vectorizer, but it doesn't seem very general; it seems more intuitive to canonicalize towards reducing the number of AddExprs. But maybe pulling as much information as possible outside of the zext is generally useful enough to make this worthwhile?

Do you also plan to implement a similar transform for AddRecs? (e.g. (zext i32 {1,+,2}<%while.body> to i64)).

Hi Eli,

I have moved away from using KnownBits, so no new API anymore, I have also generalized this for AddRec's and sext's, which effectively generalized pre-existing transforms for sext's and added them anew for zext's.

Do you think this is good to go now?

Thanks,
Roman

Yes, looks fine.

In D48853#1173829, @efriedma wrote:

Yes, looks fine.

Thanks!

Closed by commit rL337859: [SCEV] Add zext(C + x + ...) -> D + zext(C-D + x + ...)<nuw><nsw> transform (authored by rtereshin). · Explain WhyJul 24 2018, 2:49 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Analysis/

ScalarEvolution.cpp

143 lines

test/

Analysis/

ScalarEvolution/

no-wrap-add-exprs.ll

81 lines

Transforms/

IndVarSimplify/

shrunk-constant.ll

2 lines

LoadStoreVectorizer/

X86/

codegenprepare-produced-address-math.ll

78 lines

SLPVectorizer/

X86/

consecutive-access.ll

162 lines

Diff 156635

lib/Analysis/ScalarEvolution.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,553 Lines • ▼ Show 20 Lines	if (PreAR && PreAR->getNoWrapFlags(WrapType)) { // proves (2)
if (Limit && isKnownPredicate(Pred, PreAR, Limit)) // proves (1)		if (Limit && isKnownPredicate(Pred, PreAR, Limit)) // proves (1)
return true;		return true;
}		}
}		}

return false;		return false;
}		}

		// Finds an integer D for an expression (C + x + y + ...) such that the top
		// level addition in (D + (C - D + x + y + ...)) would not wrap (signed or
		// unsigned) and the number of trailing zeros of (C - D + x + y + ...) is
		// maximized, where C is the \p ConstantTerm, x, y, ... are arbitrary SCEVs, and
		// the (C + x + y + ...) expression is \p WholeAddExpr.
		static APInt extractConstantWithoutWrapping(ScalarEvolution &SE,
		const SCEVConstant *ConstantTerm,
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could you please also add a description for `FullExpr`? It might be helpful to add even more examples here and describe the intended use (e.g. `ConstantTerm` is `Start` and `FullExpr` is `Step` of an `AddRec` expression). mzolotukhin: Could you please also add a description for `FullExpr`? It might be helpful to add even more…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Thanks for looking into this! Could you please also add a description for FullExpr? Sure, will do. It might be helpful to add even more examples here and describe the intended use (e.g. ConstantTerm is Start and FullExpr is Step of an AddRec expression). The next overload is the one that handles AddRec with parameter names being `ConstantStart` and `Step`. Do you think the names are self-explanatory or I need to elaborate in a comment? As for AddExpr-version here I'm going to elaborate what `FullExpr` is and hopefully that will be clear enough. rtereshin: Thanks for looking into this! > Could you please also add a description for FullExpr? Sure…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Thanks! I'm fine with whatever way you choose, I'd just like to see some example/description of what `FullExpr` should be. For instance, I spent some time thinking that `FullExpr` would be, e.g. `3 + 4x + 6y` (which was inspired by your examples) and I couldn't understand how you can ever get TZ not equal to 0. It all made sense in the end, but saying that `ConstantStart` is 3 and `FullExpr` is `4x + 6y` would've saved me some time :) mzolotukhin: Thanks! I'm fine with whatever way you choose, I'd just like to see some example/description of…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Thing is, you were right, `SCEVAddExpr FullExpr` is `3 + 4x + 6y`. It's the iteration (`for (unsigned I = 1,...`) that goes from operand 1 instead of operand 0. I feel like it would be a little trickier to ask clients of this function to provide a reference (a pair of operand iterators, for instance) to `4x + 6y`. rtereshin:* Thing is, you were right, `SCEVAddExpr FullExpr` is `3 + 4x + 6*y`. It's the iteration (`for…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Right, we start from 1! Anyways, some note explaining what's going on with `FullExpr` should help here. mzolotukhin: Right, we start from 1! Anyways, some note explaining what's going on with `FullExpr` should…
		const SCEVAddExpr *WholeAddExpr) {
		const APInt C = ConstantTerm->getAPInt();
		const unsigned BitWidth = C.getBitWidth();
		// Find number of trailing zeros of (x + y + ...) w/o the C first:
		uint32_t TZ = BitWidth;
		for (unsigned I = 1, E = WholeAddExpr->getNumOperands(); I < E && TZ; ++I)
		TZ = std::min(TZ, SE.GetMinTrailingZeros(WholeAddExpr->getOperand(I)));
		if (TZ) {
		// Set D to be as many least significant bits of C as possible while still
		// guaranteeing that adding D to (C - D + x + y + ...) won't cause a wrap:
		return TZ < BitWidth ? C.trunc(TZ).zext(BitWidth) : C;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Maybe move that comment to the explanation in the commit message too? It's not obvious what `ConstantRange` has to with the code around (I understand where it comes from, but I have the full context of the patch now - for someone browsing through the code later it won't be clear why we mention `ConstantRange` here). mzolotukhin: Maybe move that comment to the explanation in the commit message too? It's not obvious what…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Yeah, it felt like out of place for a while now, that's a good suggestion, will do. Thanks for accepting the patch! rtereshin: Yeah, it felt like out of place for a while now, that's a good suggestion, will do. Thanks for…
		}
		return APInt(BitWidth, 0);
		}

		// Finds an integer D for an affine AddRec expression {C,+,x} such that the top
		// level addition in (D + {C-D,+,x}) would not wrap (signed or unsigned) and the
		// number of trailing zeros of (C - D + x * n) is maximized, where C is the \p
		// ConstantStart, x is an arbitrary \p Step, and n is the loop trip count.
		static APInt extractConstantWithoutWrapping(ScalarEvolution &SE,
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Just checking my understanding: we're basically finding the largest common denominator here, which is also a power of 2, right? mzolotukhin: Just checking my understanding: we're basically finding the largest common denominator here…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions we're basically finding the largest common denominator here, which is also a power of 2, right? This uint32_t TZ = BitWidth; for (unsigned I = 1, E = FullExpr->getNumOperands(); I < E && TZ; ++I) TZ = std::min(TZ, SE.GetMinTrailingZeros(FullExpr->getOperand(I))) piece effectively does exactly that, yes. Another way to look at it is to say that we have an `AddExpr` that looks like `(C + x + y + ...)`, where `C` is a constant and x, y, ... are arbitrary SCEVs, and we're computing the minimum number of trailing zeroes guaranteed of the sum w/o the constant term: `(x + y + ...)`. If, for example, those terms look like follows: i XXXX...X000 YYYY...YY00 ... ZZZZ...0000 then the rightmost non-guaranteed zero bit (a potential one at i-th position above) can change the bits of the sum to the left, but it can not possibly change the bits to the right. So we can compute the number of trailing zeroes by taking a minimum between the numbers of trailing zeroes of the terms. Now let's say that our original sum with the constant is effectively just `C + X`, where `X = x + y + ...`. Let's say we've got 2 guaranteed trailing zeros for `X`: j CCCC...CCCC XXXX...XX00 Any bit of `C` to the left of `j` may in the end cause the `C + X` sum to wrap, but the rightmost 2 bits of `C` (at positions `j` and `j - 1`) do not affect wrapping in any way. If the upper bits cause a wrap, it will be a wrap regardless of the values of the 2 least significant bits of `C`. If the upper bits do not cause a wrap, it won't be a wrap regardless of the values of the 2 bits on the right (again). So let's split C to 2 constants like follows: 0000...00CC = D CCCC...CC00 = (C - D) and the whole sum like `D + (C - D + X)`. The second term of this new sum looks like this: CCCC...CC00 XXXX...XX00 ----------- YYYY...YY00 The sum above (let's call it `Y`)) may or may not wrap, we don't know, so we need to keep it under a sext/zext. Adding `D` to that sum though will never wrap, signed or unsigned, if performed on the original bit width or the extended one, because all that that final add does is setting the 2 least significant bits of `Y` to the bits of `D`: YYYY...YY00 = Y 0000...00CC = D ----------- <nuw><nsw> YYYY...YYCC Which means we can safely move that D out of the sext or zext and claim that the top-level sum neither sign wraps nor unsigned wraps. Let's run an example, let's say we're working in `i8`s and the original expression (zext's or sext's operand) is `21 + 12x + 8y`. So it goes like this: 0001 0101 // 21 XXXX XX00 // 12x YYYY Y000 // 8y 0001 0101 // 21 ZZZZ ZZ00 // 12x + 8y // true, alternatively one can say that gcd(12, 8) is guaranteed to have 2 zeroes on the right 0000 0001 // D 0001 0100 // 21 - D = 20 ZZZZ ZZ00 // 12x + 8y 0000 0001 // D WWWW WW00 // 21 - D + 12x + 8y = 20 + 12x + 8y therefore `zext(21 + 12x + 8y)` = `(1 + zext(20 + 12x + 8y)<nuw><nsw>` rtereshin: > we're basically finding the largest common denominator here, which is also a power of 2…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Thanks for the great explanation! I think it's worth having it or its shorter version somewhere in comments. And just to be clear: I think that the patch is already very well-commented (thank you for that!), my remarks are just nit-picks. mzolotukhin: Thanks for the great explanation! I think it's worth having it or its shorter version somewhere…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions You are welcome! Hm... Do you think it could be better if I just put this as is in the commit message instead? If someone goes curious, they git blame and see a detailed explanation attributed to the exact version of the code that that explanation describes? This way we don't have a really huge comment in code that will most certainly get out of synch with the implementation at some point in the future. rtereshin: You are welcome! Hm... Do you think it could be better if I just put this as is in the commit…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Do you think it could be better if I just put this as is in the commit message instead? Yeah, that's a good idea. mzolotukhin: > Do you think it could be better if I just put this as is in the commit message instead? Yeah…
		const APInt &ConstantStart,
		const SCEV *Step) {
		const unsigned BitWidth = ConstantStart.getBitWidth();
		const uint32_t TZ = SE.GetMinTrailingZeros(Step);
		if (TZ)
		return TZ < BitWidth ? ConstantStart.trunc(TZ).zext(BitWidth)
		: ConstantStart;
		return APInt(BitWidth, 0);
		}

const SCEV *		const SCEV *
ScalarEvolution::getZeroExtendExpr(const SCEV Op, Type Ty, unsigned Depth) {		ScalarEvolution::getZeroExtendExpr(const SCEV Op, Type Ty, unsigned Depth) {
assert(getTypeSizeInBits(Op->getType()) < getTypeSizeInBits(Ty) &&		assert(getTypeSizeInBits(Op->getType()) < getTypeSizeInBits(Ty) &&
"This is not an extending conversion!");		"This is not an extending conversion!");
assert(isSCEVable(Ty) &&		assert(isSCEVable(Ty) &&
"This is not a conversion to a SCEVable type!");		"This is not a conversion to a SCEVable type!");
Ty = getEffectiveSCEVType(Ty);		Ty = getEffectiveSCEVType(Ty);

▲ Show 20 Lines • Show All 170 Lines • ▼ Show 20 Lines	if (AR->isAffine()) {
getExtendAddRecStart<SCEVZeroExtendExpr>(AR, Ty, this,		getExtendAddRecStart<SCEVZeroExtendExpr>(AR, Ty, this,
Depth + 1),		Depth + 1),
getSignExtendExpr(Step, Ty, Depth + 1), L,		getSignExtendExpr(Step, Ty, Depth + 1), L,
AR->getNoWrapFlags());		AR->getNoWrapFlags());
}		}
}		}
}		}

		// zext({C,+,Step}) --> (zext(D) + zext({C-D,+,Step}))<nuw><nsw>
		// if D + (C - D + Step * n) could be proven to not unsigned wrap
		// where D maximizes the number of trailing zeros of (C - D + Step * n)
		if (const auto *SC = dyn_cast<SCEVConstant>(Start)) {
		const APInt &C = SC->getAPInt();
		const APInt &D = extractConstantWithoutWrapping(*this, C, Step);
		if (D != 0) {
		const SCEV *SZExtD = getZeroExtendExpr(getConstant(D), Ty, Depth);
		const SCEV *SResidual =
		getAddRecExpr(getConstant(C - D), Step, L, AR->getNoWrapFlags());
		const SCEV *SZExtR = getZeroExtendExpr(SResidual, Ty, Depth + 1);
		return getAddExpr(SZExtD, SZExtR,
		(SCEV::NoWrapFlags)(SCEV::FlagNSW \| SCEV::FlagNUW),
		Depth + 1);
		}
		}

if (proveNoWrapByVaryingStart<SCEVZeroExtendExpr>(Start, Step, L)) {		if (proveNoWrapByVaryingStart<SCEVZeroExtendExpr>(Start, Step, L)) {
const_cast<SCEVAddRecExpr *>(AR)->setNoWrapFlags(SCEV::FlagNUW);		const_cast<SCEVAddRecExpr *>(AR)->setNoWrapFlags(SCEV::FlagNUW);
return getAddRecExpr(		return getAddRecExpr(
getExtendAddRecStart<SCEVZeroExtendExpr>(AR, Ty, this, Depth + 1),		getExtendAddRecStart<SCEVZeroExtendExpr>(AR, Ty, this, Depth + 1),
getZeroExtendExpr(Step, Ty, Depth + 1), L, AR->getNoWrapFlags());		getZeroExtendExpr(Step, Ty, Depth + 1), L, AR->getNoWrapFlags());
}		}
}		}

Show All 16 Lines	if (auto *SA = dyn_cast<SCEVAddExpr>(Op)) {
if (SA->hasNoUnsignedWrap()) {		if (SA->hasNoUnsignedWrap()) {
// If the addition does not unsign overflow then we can, by definition,		// If the addition does not unsign overflow then we can, by definition,
// commute the zero extension with the addition operation.		// commute the zero extension with the addition operation.
SmallVector<const SCEV *, 4> Ops;		SmallVector<const SCEV *, 4> Ops;
for (const auto *Op : SA->operands())		for (const auto *Op : SA->operands())
Ops.push_back(getZeroExtendExpr(Op, Ty, Depth + 1));		Ops.push_back(getZeroExtendExpr(Op, Ty, Depth + 1));
return getAddExpr(Ops, SCEV::FlagNUW, Depth + 1);		return getAddExpr(Ops, SCEV::FlagNUW, Depth + 1);
}		}

		// zext(C + x + y + ...) --> (zext(D) + zext((C - D) + x + y + ...))
		// if D + (C - D + x + y + ...) could be proven to not unsigned wrap
		// where D maximizes the number of trailing zeros of (C - D + x + y + ...)
		//
		// Often address arithmetics contain expressions like
		// (zext (add (shl X, C1), C2)), for instance, (zext (5 + (4 * X))).
		// This transformation is useful while proving that such expressions are
		// equal or differ by a small constant amount, see LoadStoreVectorizer pass.
		if (const auto *SC = dyn_cast<SCEVConstant>(SA->getOperand(0))) {
		const APInt &D = extractConstantWithoutWrapping(*this, SC, SA);
		if (D != 0) {
		const SCEV *SZExtD = getZeroExtendExpr(getConstant(D), Ty, Depth);
		const SCEV *SResidual =
		getAddExpr(getConstant(-D), SA, SCEV::FlagAnyWrap, Depth);
		const SCEV *SZExtR = getZeroExtendExpr(SResidual, Ty, Depth + 1);
		return getAddExpr(SZExtD, SZExtR,
		(SCEV::NoWrapFlags)(SCEV::FlagNSW \| SCEV::FlagNUW),
		Depth + 1);
		}
		}
}		}

if (auto *SM = dyn_cast<SCEVMulExpr>(Op)) {		if (auto *SM = dyn_cast<SCEVMulExpr>(Op)) {
// zext((A * B * ...)<nuw>) --> (zext(A) * zext(B) * ...)<nuw>		// zext((A * B * ...)<nuw>) --> (zext(A) * zext(B) * ...)<nuw>
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Another thing to discuss here is the fact that SCEV appears to be relying on value range analysis implemented via `ConstantRange` instead of `KnownBits`. It appears to me that we could achieve better results if we used both simultaneously updating them properly. See the example above justifying that. Do you think it's worth bringing up at dev mailing list level? rtereshin: Another thing to discuss here is the fact that SCEV appears to be relying on value range…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Now's the adventurous bit: KnownBits is able to prove that C1 + (C2 * 2^n * X) doesn't wrap if C1 < 2^n precisely because KnownBits operates over the arithmetic base of 2. If KnownBits operated over base of 3, for example, we could use it to prove that C1 + (C2 * 3^n * X) doesn't wrap (for instance, u * 3 + 1. Indeed, if bits of 3 are all unknown, KnownBits<base 3> of (u * 3) is XXXX XXX0 and therefore u * 3 + C doesn't wrap for any C <- {0, 1, 2}). I suspect that it could be proven that there is a basis (in linear algebra sense) in the system of KnownBits' that is sufficient: KnownBits over prime numbers. So let's say for every SCEV expression we cache not just ConstantRange, but every non-trivial KnownBits<B>, where B (the base) <- {p1, p2, ..., pK}, p<i> is i-th prime number, K is some reasonable limit, and "non-trivial" means "not all bits (or rather digits) are unknown", and we use that information to effectively restore <nuw>/<nsw> flags where needed. rtereshin: Now's the adventurous bit: KnownBits is able to prove that C1 + (C2 * 2^n * X) doesn't wrap if…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Ah, I just realized that due to the unfortunate fact that (2^4 - 1) is divisible by 3 and 5, KnownBits over base 3 won't allow us to prove that 3u + 1 doesn't wrap, as it very well may. It will allow us to prove though that 3u + 1, 3u + 2, and 3u + 3 are consecutive, but it's probably not as useful as if we could start from 0. Same for base 5. rtereshin: Ah, I just realized that due to the unfortunate fact that (2^4 - 1) is divisible by 3 and 5…
if (SM->hasNoUnsignedWrap()) {		if (SM->hasNoUnsignedWrap()) {
// If the multiply does not unsign overflow then we can, by definition,		// If the multiply does not unsign overflow then we can, by definition,
// commute the zero extension with the multiply operation.		// commute the zero extension with the multiply operation.
		mkazantsevUnsubmitted Not Done Reply Inline Actions I don't understand why we need this. `computeKnownBits` is used to deduce ranges of SCEVUnknown. All other SCEV nodes are supposed to propagate range information (e.g. range of sum is a range from sum of min to sum of max, and so on). Thus, in theory, we should be able to identify the range of any SCEV correctly, unless we have some missing logic in range calculation. What is `OpV` in the example you're trying to improve, and why SCEV was unable to deduce its range via `getUnsignedRange(getSCEV(OpV))`? mkazantsev: I don't understand why we need this. `computeKnownBits` is used to deduce ranges of SCEVUnknown.
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions What is OpV in the example you're trying to improve One of the examples is given in the comment nearby: // (zext (add (shl X, C1), C2)), for instance, (zext (5 + (4 * X))). // ConstantRange is unable to prove that 1 + (4 + 4 * X) doesn't wrap in // such cases: // // \| Expression \| ConstantRange \| KnownBits \| // \|------------\|------------------------\|-----------------------\| // \| 4 * X \| [L: 0, U: 253) \| XXXX XX00 \| // \| \| => Min: 0, Max: 252 \| => Min: 0, Max: 252 \| // \| \| \| \| // \| 4 * X + 4 \| [L: 4, U: 1) (wrapped) \| YYYY YY00 \| // \| \| => Min: 0, Max: 255 \| => Min: 0, Max: 252 \| see also lldb session running a similar example, also present in `test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll` updated in this patch: 1814 if (OpV) { 1815 const DataLayout &DL = getDataLayout(); 1816 KnownBits Known = computeKnownBits(OpV, DL, 0, &AC, nullptr, &DT); -> 1817 MinValue = Known.One.ugt(MinValue) ? Known.One : MinValue; 1818 } 1819 APInt C = SC->getAPInt(); 1820 APInt D = MinValue.ugt(C) ? C : MinValue; Target 0: (opt) stopped. (lldb) p OpV->dump() %t1 = add i8 %t0, 5 (lldb) p SA->dump() (5 + (4 * %x)) (lldb) p MinValue (llvm::APInt) $1 = { U = { VAL = 0 pVal = 0x0000000000000000 } BitWidth = 8 } (lldb) p Known.One.dump() APInt(8b, 1u 1s) (lldb) p getUnsignedRange(SA) (llvm::ConstantRange) $2 = { Lower = { U = { VAL = 5 pVal = 0x0000000000000005 } BitWidth = 8 } Upper = { U = { VAL = 2 pVal = 0x0000000000000002 } BitWidth = 8 } } and why SCEV was unable to deduce its range via getUnsignedRange(getSCEV(OpV))? As you mentioned, the range information is calculated using the e.g. range of sum is a range from sum of min to sum of max, and so on principle. `ConstantRange` keeps track of min/max boundaries, but it completely loses any periodic information, like "the range contains only values divisible by 4". `KnownBits` behavior is exact opposite: its imprecise when it comes to boundaries, but it keeps track of the periodic information. For instance, the only thing that is known about `4 * x` is that the 2 least significant bits of the value are 0s. From `ConstantRange` perspective it only means that the value doesn't exceed `2^32 - 4` if treated as unsigned `i32`. It's completely unaware of the fact that `4 * x` could not be 7, for instance. If we shift the range by adding 5 (`4 * x + 5`) min/max recomputation of the range leads to the wrapped around range `[5, 2)`, that gives us no useful information about the minimum and maximum values (minimum is `0`, maximum is `2^32 - 1`). While from known bits we know that `4 * x + 5` looks like `XXX...XX01`, therefore the minimum value is `1`, and the maximum value is `2^32 - 3`. Ranges and KnownBits are complementary to each other, neither is more precise than the other in all cases. If we want a value range analysis with good precision, we need to maintain and update both simultaneously. rtereshin: > What is OpV in the example you're trying to improve One of the examples is given in the…
		rtereshinAuthorUnsubmitted Not Done Reply Inline Actions @mkazantsev I think I see what's the source of the confusion: apparently, the current implementation tries to utilize knownbits-like information in a limited form of "number of trailing zeros", which is computed for `Add` the following way: if (const SCEVAddExpr A = dyn_cast<SCEVAddExpr>(S)) { // The result is the min of all operands results. uint32_t MinOpRes = GetMinTrailingZeros(A->getOperand(0)); for (unsigned i = 1, e = A->getNumOperands(); MinOpRes && i != e; ++i) MinOpRes = std::min(MinOpRes, GetMinTrailingZeros(A->getOperand(i))); return MinOpRes; } https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5375-L5381 So it does work for expressions like `4 + 4 x` (well, sometimes, somehow that kind of information is there for unsigned ranges, but not for signed ranges), and that makes my comment inaccurate. I will change it to `5 + 4 * x` example. For `5 + 4 * x` it doesn't work, of course, as the number of trailing zeroes is 0 (`5 + 4 * x` ~= `XXX...XX01`). rtereshin: @mkazantsev I think I see what's the source of the confusion: apparently, the current…
SmallVector<const SCEV *, 4> Ops;		SmallVector<const SCEV *, 4> Ops;
for (const auto *Op : SM->operands())		for (const auto *Op : SM->operands())
Ops.push_back(getZeroExtendExpr(Op, Ty, Depth + 1));		Ops.push_back(getZeroExtendExpr(Op, Ty, Depth + 1));
return getMulExpr(Ops, SCEV::FlagNUW, Depth + 1);		return getMulExpr(Ops, SCEV::FlagNUW, Depth + 1);
}		}

// zext(2^K * (trunc X to iN)) to iM ->		// zext(2^K * (trunc X to iN)) to iM ->
// 2^K * (zext(trunc X to i{N-K}) to iM)<nuw>		// 2^K * (zext(trunc X to i{N-K}) to iM)<nuw>
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	if (const SCEVTruncateExpr *ST = dyn_cast<SCEVTruncateExpr>(Op)) {
ConstantRange CR = getSignedRange(X);		ConstantRange CR = getSignedRange(X);
unsigned TruncBits = getTypeSizeInBits(ST->getType());		unsigned TruncBits = getTypeSizeInBits(ST->getType());
unsigned NewBits = getTypeSizeInBits(Ty);		unsigned NewBits = getTypeSizeInBits(Ty);
if (CR.truncate(TruncBits).signExtend(NewBits).contains(		if (CR.truncate(TruncBits).signExtend(NewBits).contains(
CR.sextOrTrunc(NewBits)))		CR.sextOrTrunc(NewBits)))
return getTruncateOrSignExtend(X, Ty);		return getTruncateOrSignExtend(X, Ty);
}		}

// sext(C1 + (C2 * x)) --> C1 + sext(C2 * x) if C1 < C2
if (auto *SA = dyn_cast<SCEVAddExpr>(Op)) {		if (auto *SA = dyn_cast<SCEVAddExpr>(Op)) {
if (SA->getNumOperands() == 2) {
auto *SC1 = dyn_cast<SCEVConstant>(SA->getOperand(0));
auto *SMul = dyn_cast<SCEVMulExpr>(SA->getOperand(1));
if (SMul && SC1) {
if (auto *SC2 = dyn_cast<SCEVConstant>(SMul->getOperand(0))) {
const APInt &C1 = SC1->getAPInt();
const APInt &C2 = SC2->getAPInt();
if (C1.isStrictlyPositive() && C2.isStrictlyPositive() &&
C2.ugt(C1) && C2.isPowerOf2())
return getAddExpr(getSignExtendExpr(SC1, Ty, Depth + 1),
getSignExtendExpr(SMul, Ty, Depth + 1),
SCEV::FlagAnyWrap, Depth + 1);
}
}
}

// sext((A + B + ...)<nsw>) --> (sext(A) + sext(B) + ...)<nsw>		// sext((A + B + ...)<nsw>) --> (sext(A) + sext(B) + ...)<nsw>
if (SA->hasNoSignedWrap()) {		if (SA->hasNoSignedWrap()) {
// If the addition does not sign overflow then we can, by definition,		// If the addition does not sign overflow then we can, by definition,
// commute the sign extension with the addition operation.		// commute the sign extension with the addition operation.
SmallVector<const SCEV *, 4> Ops;		SmallVector<const SCEV *, 4> Ops;
for (const auto *Op : SA->operands())		for (const auto *Op : SA->operands())
Ops.push_back(getSignExtendExpr(Op, Ty, Depth + 1));		Ops.push_back(getSignExtendExpr(Op, Ty, Depth + 1));
return getAddExpr(Ops, SCEV::FlagNSW, Depth + 1);		return getAddExpr(Ops, SCEV::FlagNSW, Depth + 1);
}		}

		// sext(C + x + y + ...) --> (sext(D) + sext((C - D) + x + y + ...))
		// if D + (C - D + x + y + ...) could be proven to not signed wrap
		// where D maximizes the number of trailing zeros of (C - D + x + y + ...)
		//
		// For instance, this will bring two seemingly different expressions:
		// 1 + sext(5 + 20 * %x + 24 * %y) and
		// sext(6 + 20 * %x + 24 * %y)
		// to the same form:
		// 2 + sext(4 + 20 * %x + 24 * %y)
		if (const auto *SC = dyn_cast<SCEVConstant>(SA->getOperand(0))) {
		const APInt &D = extractConstantWithoutWrapping(*this, SC, SA);
		if (D != 0) {
		const SCEV *SSExtD = getSignExtendExpr(getConstant(D), Ty, Depth);
		const SCEV *SResidual =
		getAddExpr(getConstant(-D), SA, SCEV::FlagAnyWrap, Depth);
		const SCEV *SSExtR = getSignExtendExpr(SResidual, Ty, Depth + 1);
		return getAddExpr(SSExtD, SSExtR,
		(SCEV::NoWrapFlags)(SCEV::FlagNSW \| SCEV::FlagNUW),
		Depth + 1);
		}
		}
}		}
// If the input value is a chrec scev, and we can prove that the value		// If the input value is a chrec scev, and we can prove that the value
// did not overflow the old, smaller, value, we can sign extend all of the		// did not overflow the old, smaller, value, we can sign extend all of the
// operands (often constants). This allows analysis of something like		// operands (often constants). This allows analysis of something like
// this: for (signed char X = 0; X < 100; ++X) { int Y = X; }		// this: for (signed char X = 0; X < 100; ++X) { int Y = X; }
if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Op))		if (const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(Op))
if (AR->isAffine()) {		if (AR->isAffine()) {
const SCEV *Start = AR->getStart();		const SCEV *Start = AR->getStart();
▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	if (AR->isAffine()) {
// Cache knowledge of AR NSW, then propagate NSW to the wide AddRec.		// Cache knowledge of AR NSW, then propagate NSW to the wide AddRec.
const_cast<SCEVAddRecExpr *>(AR)->setNoWrapFlags(SCEV::FlagNSW);		const_cast<SCEVAddRecExpr *>(AR)->setNoWrapFlags(SCEV::FlagNSW);
return getAddRecExpr(		return getAddRecExpr(
getExtendAddRecStart<SCEVSignExtendExpr>(AR, Ty, this, Depth + 1),		getExtendAddRecStart<SCEVSignExtendExpr>(AR, Ty, this, Depth + 1),
getSignExtendExpr(Step, Ty, Depth + 1), L, AR->getNoWrapFlags());		getSignExtendExpr(Step, Ty, Depth + 1), L, AR->getNoWrapFlags());
}		}
}		}

// If Start and Step are constants, check if we can apply this		// sext({C,+,Step}) --> (sext(D) + sext({C-D,+,Step}))<nuw><nsw>
// transformation:		// if D + (C - D + Step * n) could be proven to not signed wrap
// sext{C1,+,C2} --> C1 + sext{0,+,C2} if C1 < C2		// where D maximizes the number of trailing zeros of (C - D + Step * n)
auto *SC1 = dyn_cast<SCEVConstant>(Start);		if (const auto *SC = dyn_cast<SCEVConstant>(Start)) {
auto *SC2 = dyn_cast<SCEVConstant>(Step);		const APInt &C = SC->getAPInt();
if (SC1 && SC2) {		const APInt &D = extractConstantWithoutWrapping(*this, C, Step);
const APInt &C1 = SC1->getAPInt();		if (D != 0) {
const APInt &C2 = SC2->getAPInt();		const SCEV *SSExtD = getSignExtendExpr(getConstant(D), Ty, Depth);
if (C1.isStrictlyPositive() && C2.isStrictlyPositive() && C2.ugt(C1) &&		const SCEV *SResidual =
C2.isPowerOf2()) {		getAddRecExpr(getConstant(C - D), Step, L, AR->getNoWrapFlags());
Start = getSignExtendExpr(Start, Ty, Depth + 1);		const SCEV *SSExtR = getSignExtendExpr(SResidual, Ty, Depth + 1);
const SCEV *NewAR = getAddRecExpr(getZero(AR->getType()), Step, L,		return getAddExpr(SSExtD, SSExtR,
AR->getNoWrapFlags());		(SCEV::NoWrapFlags)(SCEV::FlagNSW \| SCEV::FlagNUW),
return getAddExpr(Start, getSignExtendExpr(NewAR, Ty, Depth + 1),		Depth + 1);
SCEV::FlagAnyWrap, Depth + 1);
}		}
}		}

if (proveNoWrapByVaryingStart<SCEVSignExtendExpr>(Start, Step, L)) {		if (proveNoWrapByVaryingStart<SCEVSignExtendExpr>(Start, Step, L)) {
const_cast<SCEVAddRecExpr *>(AR)->setNoWrapFlags(SCEV::FlagNSW);		const_cast<SCEVAddRecExpr *>(AR)->setNoWrapFlags(SCEV::FlagNSW);
return getAddRecExpr(		return getAddRecExpr(
getExtendAddRecStart<SCEVSignExtendExpr>(AR, Ty, this, Depth + 1),		getExtendAddRecStart<SCEVSignExtendExpr>(AR, Ty, this, Depth + 1),
getSignExtendExpr(Step, Ty, Depth + 1), L, AR->getNoWrapFlags());		getSignExtendExpr(Step, Ty, Depth + 1), L, AR->getNoWrapFlags());
▲ Show 20 Lines • Show All 9,991 Lines • Show Last 20 Lines

test/Analysis/ScalarEvolution/no-wrap-add-exprs.ll

	Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines

	; CHECK: %q0.zext = zext i8 %q0 to i16			; CHECK: %q0.zext = zext i8 %q0 to i16
	; CHECK-NEXT: --> (zext i8 (1 + %len_norange) to i16) U: [0,256) S: [0,256)			; CHECK-NEXT: --> (zext i8 (1 + %len_norange) to i16) U: [0,256) S: [0,256)
	; CHECK: %q1.zext = zext i8 %q1 to i16			; CHECK: %q1.zext = zext i8 %q1 to i16
	; CHECK-NEXT: --> (zext i8 (2 + %len_norange) to i16) U: [0,256) S: [0,256)			; CHECK-NEXT: --> (zext i8 (2 + %len_norange) to i16) U: [0,256) S: [0,256)

	ret void			ret void
	}			}

				@z_addr = external global [16 x i8], align 4
				@z_addr_noalign = external global [16 x i8]

				%union = type { [10 x [4 x float]] }
				@tmp_addr = external unnamed_addr global { %union, [2000 x i8] }

				define void @f3(i8* %x_addr, i8* %y_addr, i32* %tmp_addr) {
				; CHECK-LABEL: Classifying expressions for: @f3
				entry:
				%x = load i8, i8* %x_addr
				%t0 = mul i8 %x, 4
				%t1 = add i8 %t0, 5
				%t1.zext = zext i8 %t1 to i16
				; CHECK: %t1.zext = zext i8 %t1 to i16
				; CHECK-NEXT: --> (1 + (zext i8 (4 + (4 * %x)) to i16))<nuw><nsw> U: [1,254) S: [1,257)

				%q0 = mul i8 %x, 4
				%q1 = add i8 %q0, 7
				%q1.zext = zext i8 %q1 to i16
				; CHECK: %q1.zext = zext i8 %q1 to i16
				; CHECK-NEXT: --> (3 + (zext i8 (4 + (4 * %x)) to i16))<nuw><nsw> U: [3,256) S: [3,259)

				%p0 = mul i8 %x, 4
				%p1 = add i8 %p0, 8
				%p1.zext = zext i8 %p1 to i16
				; CHECK: %p1.zext = zext i8 %p1 to i16
				; CHECK-NEXT: --> (zext i8 (8 + (4 * %x)) to i16) U: [0,253) S: [0,256)
				mkazantsevUnsubmitted Not Done Reply Inline Actions Its weird. Why signed and unsigned ranges are different? mkazantsev: Its weird. Why signed and unsigned ranges are different?
				rtereshinAuthorUnsubmitted Not Done Reply Inline Actions That's a good question. One thing I know is that the issue is orthogonal to this patch and exists on trunk: %p1.zext = zext i8 %p1 to i16 --> (zext i8 (8 + (4 * %x)) to i16) U: [0,253) S: [0,256) (this is w/o this patch applied) Perhaps unsigned range takes some knownbits-like information into account, while signed one doesn't. rtereshin: That's a good question. One thing I know is that the issue is orthogonal to this patch and…
				rtereshinAuthorUnsubmitted Not Done Reply Inline Actions Maybe this is the spot: https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2b990aad648/lib/Analysis/ScalarEvolution.cpp#L5594-L5613 rtereshin: Maybe this is the spot: https://github.com/llvm-mirror/llvm/blob/650cfa6dc060acb5b4c9571d454ec2…

				%r0 = mul i8 %x, 4
				%r1 = add i8 %r0, 254
				%r1.zext = zext i8 %r1 to i16
				; CHECK: %r1.zext = zext i8 %r1 to i16
				; CHECK-NEXT: --> (2 + (zext i8 (-4 + (4 * %x)) to i16))<nuw><nsw> U: [2,255) S: [2,258)

				%y = load i8, i8* %y_addr
				%s0 = mul i8 %x, 32
				%s1 = mul i8 %y, 36
				%s2 = add i8 %s0, %s1
				%s3 = add i8 %s2, 5
				%s3.zext = zext i8 %s3 to i16
				; CHECK: %s3.zext = zext i8 %s3 to i16
				; CHECK-NEXT: --> (1 + (zext i8 (4 + (32 * %x) + (36 * %y)) to i16))<nuw><nsw> U: [1,254) S: [1,257)

				%ptr = bitcast [16 x i8]* @z_addr to i8*
				%int0 = ptrtoint i8* %ptr to i32
				%int5 = add i32 %int0, 5
				%int.zext = zext i32 %int5 to i64
				; CHECK: %int.zext = zext i32 %int5 to i64
				; CHECK-NEXT: --> (1 + (zext i32 (4 + %int0) to i64))<nuw><nsw> U: [1,4294967294) S: [1,4294967297)

				%ptr_noalign = bitcast [16 x i8]* @z_addr_noalign to i8*
				%int0_na = ptrtoint i8* %ptr_noalign to i32
				%int5_na = add i32 %int0_na, 5
				%int.zext_na = zext i32 %int5_na to i64
				; CHECK: %int.zext_na = zext i32 %int5_na to i64
				; CHECK-NEXT: --> (zext i32 (5 + %int0_na) to i64) U: [0,4294967296) S: [0,4294967296)

				%tmp = load i32, i32* %tmp_addr
				%mul = and i32 %tmp, -4
				%add4 = add i32 %mul, 4
				%add4.zext = zext i32 %add4 to i64
				%sunkaddr3 = mul i64 %add4.zext, 4
				%sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @tmp_addr to i8*), i64 %sunkaddr3
				%sunkaddr5 = getelementptr inbounds i8, i8* %sunkaddr4, i64 4096
				%addr4.cast = bitcast i8* %sunkaddr5 to i32*
				%addr4.incr = getelementptr i32, i32* %addr4.cast, i64 1
				; CHECK: %addr4.incr = getelementptr i32, i32* %addr4.cast, i64 1
				; CHECK-NEXT: --> ([[C:4100]] + ([[SIZE:4]] * (zext i32 ([[OFFSET:4]] + ([[STRIDE:4]] * (%tmp /u [[STRIDE]]))<nuw>) to i64))<nuw><nsw> + @tmp_addr)

				%add5 = add i32 %mul, 5
				%add5.zext = zext i32 %add5 to i64
				%sunkaddr0 = mul i64 %add5.zext, 4
				%sunkaddr1 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @tmp_addr to i8*), i64 %sunkaddr0
				%sunkaddr2 = getelementptr inbounds i8, i8* %sunkaddr1, i64 4096
				%addr5.cast = bitcast i8* %sunkaddr2 to i32*
				; CHECK: %addr5.cast = bitcast i8* %sunkaddr2 to i32*
				; CHECK-NEXT: --> ([[C]] + ([[SIZE]] * (zext i32 ([[OFFSET]] + ([[STRIDE]] * (%tmp /u [[STRIDE]]))<nuw>) to i64))<nuw><nsw> + @tmp_addr)

				ret void
				}

test/Transforms/IndVarSimplify/shrunk-constant.ll

	; RUN: opt < %s -scalar-evolution -analyze \| FileCheck %s			; RUN: opt < %s -scalar-evolution -analyze \| FileCheck %s

	; CHECK: --> (zext i4 {-7,+,-8}<%loop> to i32)			; CHECK: --> (1 + (zext i4 {-8,+,-8}<%loop> to i32))<nuw><nsw>

	define fastcc void @foo() nounwind {			define fastcc void @foo() nounwind {
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%i = phi i32 [ 0, %entry ], [ %t2, %loop ]			%i = phi i32 [ 0, %entry ], [ %t2, %loop ]
	%t0 = add i32 %i, 9			%t0 = add i32 %i, 9
	%t1 = and i32 %t0, 9			%t1 = and i32 %t0, 9
	store i32 %t1, i32* null			store i32 %t1, i32* null
	%t2 = add i32 %i, 8			%t2 = add i32 %i, 8
	br label %loop			br label %loop
	}			}

test/Transforms/LoadStoreVectorizer/X86/codegenprepare-produced-address-math.ll

This file was added.

				; RUN: opt -codegenprepare -load-store-vectorizer %s -S -o - \| FileCheck %s
				; RUN: opt -load-store-vectorizer %s -S -o - \| FileCheck %s

				target triple = "x86_64--"

				%union = type { { [4 x [4 x [4 x [16 x float]]]], [4 x [4 x [4 x [16 x float]]]], [10 x [10 x [4 x float]]] } }

				@global_pointer = external unnamed_addr global { %union, [2000 x i8] }, align 4

				; Function Attrs: convergent nounwind
				define void @test(i32 %base) #0 {
				; CHECK-LABEL: @test(
				; CHECK-NOT: load i32
				; CHECK: load <2 x i32>
				; CHECK-NOT: load i32
				entry:
				%mul331 = and i32 %base, -4
				%add350.4 = add i32 4, %mul331
				%idx351.4 = zext i32 %add350.4 to i64
				%arrayidx352.4 = getelementptr inbounds { %union, [2000 x i8] }, { %union, [2000 x i8] }* @global_pointer, i64 0, i32 0, i32 0, i32 1, i64 0, i64 0, i64 0, i64 %idx351.4
				%tmp296.4 = bitcast float* %arrayidx352.4 to i32*
				%add350.5 = add i32 5, %mul331
				%idx351.5 = zext i32 %add350.5 to i64
				%arrayidx352.5 = getelementptr inbounds { %union, [2000 x i8] }, { %union, [2000 x i8] }* @global_pointer, i64 0, i32 0, i32 0, i32 1, i64 0, i64 0, i64 0, i64 %idx351.5
				%tmp296.5 = bitcast float* %arrayidx352.5 to i32*
				%cnd = icmp ult i32 %base, 1000
				br i1 %cnd, label %loads, label %exit

				loads:
				; If and only if the loads are in a different BB from the GEPs codegenprepare
				; would try to turn the GEPs into math, which makes LoadStoreVectorizer's job
				; harder
				%tmp297.4 = load i32, i32* %tmp296.4, align 4, !tbaa !0
				%tmp297.5 = load i32, i32* %tmp296.5, align 4, !tbaa !0
				br label %exit

				exit:
				ret void
				}

				; Function Attrs: convergent nounwind
				define void @test.codegenprepared(i32 %base) #0 {
				; CHECK-LABEL: @test.codegenprepared(
				; CHECK-NOT: load i32
				; CHECK: load <2 x i32>
				; CHECK-NOT: load i32
				entry:
				%mul331 = and i32 %base, -4
				%add350.4 = add i32 4, %mul331
				%idx351.4 = zext i32 %add350.4 to i64
				%add350.5 = add i32 5, %mul331
				%idx351.5 = zext i32 %add350.5 to i64
				%cnd = icmp ult i32 %base, 1000
				br i1 %cnd, label %loads, label %exit

				loads: ; preds = %entry
				%sunkaddr = mul i64 %idx351.4, 4
				%sunkaddr1 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8*), i64 %sunkaddr
				%sunkaddr2 = getelementptr inbounds i8, i8* %sunkaddr1, i64 4096
				%0 = bitcast i8* %sunkaddr2 to i32*
				%tmp297.4 = load i32, i32* %0, align 4, !tbaa !0
				%sunkaddr3 = mul i64 %idx351.5, 4
				%sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8*), i64 %sunkaddr3
				%sunkaddr5 = getelementptr inbounds i8, i8* %sunkaddr4, i64 4096
				%1 = bitcast i8* %sunkaddr5 to i32*
				%tmp297.5 = load i32, i32* %1, align 4, !tbaa !0
				br label %exit

				exit: ; preds = %loads, %entry
				ret void
				}

				attributes #0 = { convergent nounwind }

				!0 = !{!1, !1, i64 0}
				!1 = !{!"float", !2, i64 0}
				!2 = !{!"omnipotent char", !3, i64 0}
				!3 = !{!"Simple C++ TBAA"}

test/Transforms/SLPVectorizer/X86/consecutive-access.ll

Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	for.cond.for.end_crit_edge: ; preds = %for.body
br label %for.end		br label %for.end

for.end: ; preds = %for.cond.for.end_crit_edge, %entry		for.end: ; preds = %for.cond.for.end_crit_edge, %entry
%.lcssa = phi double [ %split, %for.cond.for.end_crit_edge ], [ 0.000000e+00, %entry ]		%.lcssa = phi double [ %split, %for.cond.for.end_crit_edge ], [ 0.000000e+00, %entry ]
%conv = fptosi double %.lcssa to i32		%conv = fptosi double %.lcssa to i32
ret i32 %conv		ret i32 %conv
}		}

		; Similar to foo_2double but with a non-power-of-2 factor and potential
		; wrapping (both indices wrap or both don't in the same time)
		; CHECK-LABEL: foo_2double_non_power_of_2
		; CHECK: load <2 x double>
		; CHECK: load <2 x double>
		; Function Attrs: nounwind ssp uwtable
		define void @foo_2double_non_power_of_2(i32 %u) #0 {
		entry:
		%u.addr = alloca i32, align 4
		store i32 %u, i32* %u.addr, align 4
		%mul = mul i32 %u, 6
		%add6 = add i32 %mul, 6
		%idxprom = sext i32 %add6 to i64
		%arrayidx = getelementptr inbounds [2000 x double], [2000 x double]* @A, i32 0, i64 %idxprom
		%0 = load double, double* %arrayidx, align 8
		%arrayidx4 = getelementptr inbounds [2000 x double], [2000 x double]* @B, i32 0, i64 %idxprom
		%1 = load double, double* %arrayidx4, align 8
		%add5 = fadd double %0, %1
		store double %add5, double* %arrayidx, align 8
		%add7 = add i32 %mul, 7
		%idxprom12 = sext i32 %add7 to i64
		%arrayidx13 = getelementptr inbounds [2000 x double], [2000 x double]* @A, i32 0, i64 %idxprom12
		%2 = load double, double* %arrayidx13, align 8
		%arrayidx17 = getelementptr inbounds [2000 x double], [2000 x double]* @B, i32 0, i64 %idxprom12
		%3 = load double, double* %arrayidx17, align 8
		%add18 = fadd double %2, %3
		store double %add18, double* %arrayidx13, align 8
		ret void
		}

		; Similar to foo_2double_non_power_of_2 but with zext's instead of sext's
		; CHECK-LABEL: foo_2double_non_power_of_2_zext
		; CHECK: load <2 x double>
		; CHECK: load <2 x double>
		; Function Attrs: nounwind ssp uwtable
		define void @foo_2double_non_power_of_2_zext(i32 %u) #0 {
		entry:
		%u.addr = alloca i32, align 4
		store i32 %u, i32* %u.addr, align 4
		%mul = mul i32 %u, 6
		%add6 = add i32 %mul, 6
		%idxprom = zext i32 %add6 to i64
		%arrayidx = getelementptr inbounds [2000 x double], [2000 x double]* @A, i32 0, i64 %idxprom
		%0 = load double, double* %arrayidx, align 8
		%arrayidx4 = getelementptr inbounds [2000 x double], [2000 x double]* @B, i32 0, i64 %idxprom
		%1 = load double, double* %arrayidx4, align 8
		%add5 = fadd double %0, %1
		store double %add5, double* %arrayidx, align 8
		%add7 = add i32 %mul, 7
		%idxprom12 = zext i32 %add7 to i64
		%arrayidx13 = getelementptr inbounds [2000 x double], [2000 x double]* @A, i32 0, i64 %idxprom12
		%2 = load double, double* %arrayidx13, align 8
		%arrayidx17 = getelementptr inbounds [2000 x double], [2000 x double]* @B, i32 0, i64 %idxprom12
		%3 = load double, double* %arrayidx17, align 8
		%add18 = fadd double %2, %3
		store double %add18, double* %arrayidx13, align 8
		ret void
		}

		; Similar to foo_2double_non_power_of_2, but now we are dealing with AddRec SCEV.
		; Alternatively, this is like foo_loop, but with a non-power-of-2 factor and
		; potential wrapping (both indices wrap or both don't in the same time)
		; CHECK-LABEL: foo_loop_non_power_of_2
		; CHECK: <2 x double>
		; Function Attrs: nounwind ssp uwtable
		define i32 @foo_loop_non_power_of_2(double* %A, i32 %n) #0 {
		entry:
		%A.addr = alloca double*, align 8
		%n.addr = alloca i32, align 4
		%sum = alloca double, align 8
		%i = alloca i32, align 4
		store double* %A, double** %A.addr, align 8
		store i32 %n, i32* %n.addr, align 4
		store double 0.000000e+00, double* %sum, align 8
		store i32 0, i32* %i, align 4
		%cmp1 = icmp slt i32 0, %n
		br i1 %cmp1, label %for.body.lr.ph, label %for.end

		for.body.lr.ph: ; preds = %entry
		br label %for.body

		for.body: ; preds = %for.body.lr.ph, %for.body
		%0 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
		%1 = phi double [ 0.000000e+00, %for.body.lr.ph ], [ %add7, %for.body ]
		%mul = mul i32 %0, 12
		%add.5 = add i32 %mul, 5
		%idxprom = sext i32 %add.5 to i64
		%arrayidx = getelementptr inbounds double, double* %A, i64 %idxprom
		%2 = load double, double* %arrayidx, align 8
		%mul1 = fmul double 7.000000e+00, %2
		%add.6 = add i32 %mul, 6
		%idxprom3 = sext i32 %add.6 to i64
		%arrayidx4 = getelementptr inbounds double, double* %A, i64 %idxprom3
		%3 = load double, double* %arrayidx4, align 8
		%mul5 = fmul double 7.000000e+00, %3
		%add6 = fadd double %mul1, %mul5
		%add7 = fadd double %1, %add6
		store double %add7, double* %sum, align 8
		%inc = add i32 %0, 1
		store i32 %inc, i32* %i, align 4
		%cmp = icmp slt i32 %inc, %n
		br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge

		for.cond.for.end_crit_edge: ; preds = %for.body
		%split = phi double [ %add7, %for.body ]
		br label %for.end

		for.end: ; preds = %for.cond.for.end_crit_edge, %entry
		%.lcssa = phi double [ %split, %for.cond.for.end_crit_edge ], [ 0.000000e+00, %entry ]
		%conv = fptosi double %.lcssa to i32
		ret i32 %conv
		}

		; This is generated by `clang -std=c11 -Wpedantic -Wall -O3 main.c -S -o - -emit-llvm`
		; with !{!"clang version 7.0.0 (trunk 337339) (llvm/trunk 337344)"} and stripping off
		; the !tbaa metadata nodes to fit the rest of the test file, where `cat main.c` is:
		;
		; double bar(double *a, unsigned n) {
		; double x = 0.0;
		; double y = 0.0;
		; for (unsigned i = 0; i < n; i += 2) {
		; x += a[i];
		; y += a[i + 1];
		; }
		; return x * y;
		; }
		;
		; The resulting IR is similar to @foo_loop, but with zext's instead of sext's.
		;
		; Make sure we are able to vectorize this from now on:
		;
		; CHECK-LABEL: @bar
		; CHECK: load <2 x double>
		define double @bar(double* nocapture readonly %a, i32 %n) local_unnamed_addr #0 {
		entry:
		%cmp15 = icmp eq i32 %n, 0
		br i1 %cmp15, label %for.cond.cleanup, label %for.body

		for.cond.cleanup: ; preds = %for.body, %entry
		%x.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
		%y.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %add4, %for.body ]
		%mul = fmul double %x.0.lcssa, %y.0.lcssa
		ret double %mul

		for.body: ; preds = %entry, %for.body
		%i.018 = phi i32 [ %add5, %for.body ], [ 0, %entry ]
		%y.017 = phi double [ %add4, %for.body ], [ 0.000000e+00, %entry ]
		%x.016 = phi double [ %add, %for.body ], [ 0.000000e+00, %entry ]
		%idxprom = zext i32 %i.018 to i64
		%arrayidx = getelementptr inbounds double, double* %a, i64 %idxprom
		%0 = load double, double* %arrayidx, align 8
		%add = fadd double %x.016, %0
		%add1 = or i32 %i.018, 1
		%idxprom2 = zext i32 %add1 to i64
		%arrayidx3 = getelementptr inbounds double, double* %a, i64 %idxprom2
		%1 = load double, double* %arrayidx3, align 8
		%add4 = fadd double %y.017, %1
		%add5 = add i32 %i.018, 2
		%cmp = icmp ult i32 %add5, %n
		br i1 %cmp, label %for.body, label %for.cond.cleanup
		}

attributes #0 = { nounwind ssp uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }		attributes #0 = { nounwind ssp uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}		!llvm.ident = !{!0}

!0 = !{!"clang version 3.5.0 "}		!0 = !{!"clang version 3.5.0 "}