Since the range-based constraint manager (default) is weak in handling comparisons where symbols are on both sides it is wise to rearrange them to have symbols only on the left side. Thus e.g. `A + n >= B + m` becomes `A - B >= m - n` which enables the constraint manager to store a range `m - n .. MAX_VALUE` for the symbolic expression `A - B`. This can be used later to check whether e.g. `A + k == B + l` can be true, which is also rearranged to `A - B == l - k` so the constraint manager can check whether `l - k` is in the range (thus greater than or equal to `m - m`).

# Details

- Reviewers
NoQ zaks.anna dcoughlin george.karpenkov

# Diff Detail

### Event Timeline

Thanks, this looks great!

Because integer promotion rules are tricky, could we, for now, avoid dealing with the situation when left-hand side and right-hand side and the result (all three) are not all of the same type? Or maybe we'd like to support substraction of unsigned values into a signed value of the same size, but still avoid the rest of the cases. Because it'd take an overwhelming amount of testing to ensure that we get all the promotion cases correctly.

lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|

588 | I'm afraid we may step into |

lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|

588 | I can remember adding BinaryOperator::isAdditiveOp() at the very beginning, maybe I accidentally deleted it. |

Are you sure this works as intended when e.g.: `$a+2==$b*3`

So on one of the sides, the ops are not additive?

I would like to see some test cases for that.

lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|

577 | Is overwriting Sym a good idea here? Maybe it will be harder to maintain this code if the same variable can refer to different symbols all the time. | |

628 | In LLVM we usually do not add braces for single statement blocks. |

I think you might also need to convert `APSInt`s to an appropriate type, as done above. Type of right-hand-side `APSInt`s do not necessarily coincide with the type of the left-hand-side symbol or of the whole expression. `APSInt` operations crash when signedness doesn't match (and in a few other cases).

Currently we have broken symbolic casts, which means that we significantly ignore the type system, because symbolic expressions involving casts are much harder to simplify or deal with, while overflow cases are too often implicitly contracted out to make the analyzer good at understanding them. Which leads to everything having different type.

lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|

402โ408 | An example of how | |

526โ527 | An example of how |

Could you please help me by constructing an example for this scenario? I tried multiple options, but I always failed since I check the type of the two sides, as you proposed. So I need an example where `typeof(A+n) == typeof(B+m)` but `typeof(n) != typeof(m)`.

I think we should exclude unsigned types completely. `A >= B` is converted to `A - B >= 0`, thus the range of `A - B` is `[0 .. INT_MAX]` which is the full range for the unsigned type. This implies that upon any conditional statement `if (A >= B) ...` the range for the `false` is an empty range so the `else` branch is never executed. The following test fails:

void wrong(int m, int n) { if (m >= n) return; clang_analyzer_warnIfReached(); // expected-warning{{REACHABLE}} }

I think we'd rather delay the unsigned difference feature (as far as i understand, you don't rely on it in the iterator checker), than introduce the artificial cast to signed which may have unexpected consequences, because that's not how unsigned difference normally behaves. I'd think a bit more on this as well.

Hmm, you are right, there are only problems with casts of symbols, not of `APSInt`s; type of the latter is always equal to the type of the whole expression. So when two `SymIntExpr`s have the same type `T`, types of their right-hand sides are automatically equal to `T` as well. Please add this as a comment then :)

The rest looks good to me now.

Rearrangement of unsigned comparison removed (FIXME added). Comment regarding the types of integer constants added.

I have some concerns about soundness when the rearrangement may overflow.

Here is an example:

void clang_analyzer_eval(int); void foo(signed char A, signed char B) { if (A + 0 >= B + 0) { clang_analyzer_eval(A - 126 == B + 3); // This yields FALSE with this patch } }

But for A = 126 and B = -3 the `clang_analyzer_eval` is reachable and should evaluate to true.

Artem, Anna, and I discussed this a bit in person. We think that even though the benefits look great, it can't be generally applied. Maybe we could apply it in cases where other constraints guarantee that overflow cannot occur?

It still seems like we are inferring invariants that are not sound. Do we need to restrict the symbolic transformation so that it only applies when A - B, too, is known to not overflow? Is that sufficient? Is it often the case that the analyzer knows these restrictions?

For example, in the following we infer `B - A` is in `[127, 127]` at program point (1) but, for example, this invariant is not true when `A` is `-100` and `B` is `40`,

void bar(signed char A, signed char B) { if (A + 127 < B + 1) { // (1) } }

Similarly we infer `A - B` is in `[0, 127]` at program point (2) but, for example, this invariant is not true when `A` is `100` and `B` is `-40`,

void foo(signed char A, signed char B) { if (A + 0 >= B + 0) { // (2) } }

You can see these invariants by running with the 'debug.ViewExplodedGraph' checker enabled.

What do you suggest? Should we widen the type of the difference, or abandon this patch and revert back to the local solution I originally used in the iterator checker?

What do you suggest? Should we widen the type of the difference, or abandon this patch and revert back to the local solution I originally used in the iterator checker?

Does the local solution you used in the iterator checker not have the same problem?

lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|

566 | some thoughts - evalBinOpNN is already pretty gigantic (300+ locs), |

It's something similar to assuming that the string length is within range [0, INT_MAX/4] in CStringChecker: we can easily assume that no overflow is happening in computations involving string lengths or iterator positions, but not on generic integers. Which lead me to believing that we could maintain a no-overflow variant of evalBinOp (questionable).

Would anything go wrong if we only enable this code when both symbols are known to be within range [-max/4, max/4]? And in the checker, add the respective assumption. I believe it's a very clear way to express that no overflow is happening. In fact, in the program state we can add an API `ProgramStateRef assumeNoOverflow(SVal, QualType)`, which tries to assume that the value is within range [-max/4, max/4] for signed types or [0, max/4] for unsigned types (and fails when such assumption is known be violated), so that to avoid duplicating similar trick in every checker.

If both sides have concrete integers, they must be in range [min/4..max/4], if only one, it must be in range [min/2..max/2].

It seems that Artem's suggestion is not enough (or I misunderstood it). So two options remain: either we drop this and revert to the local solution in the Iterator Checkers or we extend the type when rearranging the comparison. Which way to go?

I tried to extend the type to avoid overflow scenarios. Unfortunately, this breaks essential calculations based on the overflow scenarios (e.g. ProgramSate::assumeInbound()). So I see no other option than to abandon this patch and return to the local solution in the iterator checkers.

I think it is the final attempt. If Symbols are different, the type is extended, so we store a correct (but extended) range. However, if the Symbols are the same, we do not change the type so checks like `assumeInbound()` are not affected.

Anna, Devin, Artem, please review it whether this version is functionally correct. (Style is another thing.) If not, then we should go back to the local solution in the iterator checkers. However, we must continue the review there, it is standing still for half a year.

lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|

595 | Do you have an exact suggestion here? | |

596 | Width of |

Any progress reviewing this? Iterator checkers are pending for more than half a year because of this.

An alternative solution is to always extend the type and change `ProgramState::assumeInbound()` so it does not move everything to the bottom of the range, but maybe to the middle.

I had a quick look into actually implementing this solution, and i generally liked how it looks. Instead of the checker saying "I know that these symbols won't overflow on any operations, so let me do whatever I want", it says "I know that these symbols are small" and the core handles the rest.

I managed to get D35110 and D32642 running on top of my solution in a fairly straightforward manner. I did not see if there is any impact on performance; if there is, in the worst case we may want to expose the "unsafe" functions to the checkers to optimize out repeated checks. A bit of cleanup may be necessary after me (eg. turn static functions into methods where appropriate).

Unsigned integers are not supported, because their overflows are harder to avoid; however, as far as i understand you are only interested in signed integers. For signed integers, rearrangement only kicks in when all symbols and concrete integers are of the same type `T` and are constrained within `[-max/4, max/4]`, where `max` is the maximum value of `T`.

Here are my patches:

All tests should pass when all 6 patches are applied.

Could you see if this works for you?

Thank you for you work! Probably I did some mistake because my MAX/4solution did not work when I it earlier. Your solution seems to work. For the iterator checkers this is surely enough. Should I upload my patch patched with your one?

A breakthrough with credit going to Devin: Note that whenever we're not dealing with `>`/`<`/`<=`/`>=` (but only with additive ops and `==` and `!=`, and we have everything of the same type) we can rearrange regardless of constraints, simply because Z/nZ is an abelian group.

First of all, it means that we definitely do not need to check constraints in those cases.

Then the question is - would it be enough to limit rearrangements to additive and equality comparisons? `operator<()` is rarely redefinedย for iterator objects. You rely on `>`/`<` comparisons to compare symbolic iterators to `.begin()` or `.end()`, but how much more does it give in practice, compared to straightforwardly matching, say, `i == i.container.end` where `N` is non-negative, given that the rest of our solver barely solves anything? Like, let's say that "iterator `i` is past end" from now on literally means "*there exists a non-negative N for which i is-the-same-symbol-as i.container.end + N*" rather than "

*". Would the checker lose much from such change? If not, we can completely screw all checks and gain a lot more value from the rearrangements.*

`i >= i.container.end`Thank you for your comment!

Unfortunately, the iterator checkers depend very much on the >/>=/</<= operators as well. We check for i<C.begin(), i>=C.end(), I cannot imagine this to be solved just with ==/!=. How to find the N if we only use == or !=? Furthermore, later parts also add invalidation check where we must find iterator positions which are in some relation with the invalidated position. For example, insertion into a vector invalidates every i>=cur positions.

How to find the N if we only use == or !=?

Hence the difference between `==` and `is-the-same-symbol-as`. We can find `N` by looking at the symbol.

We'd lose track of cases where, say, `i` and `.end()` were compared to each other for `>`/`<` before. The question is about how common such cases are.

Sorry, if I misunderstand something, but if I do not know the `N` and have to look it up then I have to create a difference manually (as I did it originally) to look it up.

We'd lose track of cases where, say,

iand.end()were compared to each other for>/<before. The question is about how common such cases are.

`i` and `.begin()` are compared to each other for `<`. And (as I mentioned) there will be lots of `<`, `<=`, `>` and `<=` operators when invalidating.

Does this mean it might be efficient enough that we want to implement this regardless what will be the final solution for the iterator checkers? Or we need a strong use case to accept such a change into core?

The unconstrained rearrangements for `+`/`-`/`==`/`!=` are definitely good to go regardless of anything else.

Within the checker, we propose to manually simplify both `$x - N < $x` and `$x + N > $x` to true (where N is positive), because in general this is unsound (due to potential overflows), but for iterator position symbols this is **the** intended behavior (if `container.end() + N` overflows, it's even worse than a regular past-end access).

For the rest of the cases (`>`/`<` with mixed symbols), we propose to delay the decision for now. For now simply admit that the solver is not awesome. If we had a perfect solver, we wouldn't have to rearrange anything, so the root cause of the problem is in the solver, and we shouldn't be fixing it on the checker side. If, before turning the checker on by default, we prove through evaluation that this shortcoming of the solver is indeed a bigger problem for this checker than for the rest of the analyzer, and is also the biggest problem with the checker, we can still come up with something (eg. the already-existing max/4 approach). But let's delay this until we see how bad this part of the problem actually is.

Thank you for your respone! However, I think you (not you, Artem, but you three at Apple) do not really understand that I need to compare A+m to B+n not only because of the iterator range checking, but also in later parts. So your proposal means that I am not allowed to fix the shortcomings of the constraint solver neither in the engine nor in the checker. Maybe part 2 of the iterator checkers patch will work, but then we will restart the whole dispute at a later part.

As I mentioned, I need these comparisons in a lot of places, not only in range checking. This is the core part of my solution. It is primarily needed for invalidation, which is the most solid part. The latest version (not part 10, but a fixed version of the whole chekcer set) finds only 5 false positives for invalidated iterator access, which I could further reduce by adding more detailed simulation of insertion and erasement.

So still the options are to fix it in the checker or fix it in the engine (the max/4 or the type extension solution), but leaving it unfixed is not an option. I am open to any solution, but only full solutions and no partial solutions, because they will not help us enough.

Just to be clear, what is the source of the problem here? As far as I understand, your concern is not supporting `$x + N > $x + M`. Does this affect the mismatched iterator part of the checker? What is the effect on the other parts? Not finding some issues? Having more false positives?

If there are parts of the check that does not depend on evaluating such constraints, maybe we could proceed with upstreaming those, so the whole process is not blocked while we still figuring out what to do with the rest. What do you think?

Let us summarize the possibilities:

- Type extension approach. I tested it, all tests pass, and also in Devin's examples we do not infer narrower range than we should. (Wider do not matter.) I think this is the most complete solution and many checkers could benefit of it. There are also two possibilities here:

a) If the symbols on both sides are equal, we do not extend the type, because of`assumeInbound()`in`ProgramState`, which first adds`MIN`to both the bound and the index. So e.g.`S-1<S`becomes`S+MIN-1<S+MIN`. However,`S+MIN-1==S+MAX`, so if we extend the type here, we get`0<MIN-MAX`, which is`0<2*MIN+1`which is`false`. So we leave the type so`0<MIN-MAX`becomes`0<1`which is`true`.

b) We extend the type even in case the symbols on both sides are equal, but we fix`assumeInbound()`in`ProgramState`to add`(MAX+MIN)/2`instead of`MIN`. (Why is this normalization necessary?)

- [MIN/4..MAX/4] approach. We loose the (probably rare) cases using large numbers, but for most checkers it is probably no problem.

- We drop this patch, and do the workaround in the iterator checker. Although it adds some complexity and other checkers cannot benefit from the rearrangement, it solves the problem of the iterator checkers at the moment.

The rearrangement only for `==` and `!=` is not enough for the iterator checkers. Maybe it is enough for checking iterator range, but it does not remove the complexity from it. To check the cases were we are completely outside of the valid range we still have to manually separate symbol and concrete integer and also do the comparison manually. To invalidate iterators after operations on the container it does not help anything, there we have to deal with `<`, `<=`, `>` and `>=` as well.

The reason why i don't want to commit the MAX/4 approach now (for `>`/`<` case) is that it has too little useful effects until the iterator checker is enabled by default. However, it is a core change that immediately affects all users with all its negative effects (such as performance and code complexity). When i say that (1) this approach has little useful effects and (2) this approach may cause performance issues, both (1) and (2) are only based on intuition (my or Devin's). If somebody investigates the impact of the MAX/4 change and shows that both concerns are in fact wrong (the approach is indeed very useful for all modeling and/or has negligible performance impact), i think it should land. Otherwise, i think it shouldn't land now, but delayed until the iterator checker is ready to be enabled by default.

For the type extension approach, somebody simply needs to do the math and prove that it works soundly in all cases. Devin has been heroically coming up with counter-examples so far, but even if he doesn't find any, it doesn't mean that it works, so ideally we shouldn't make him do this. The thing about the MAX/4 approach is that the math is fairly obvious: it is clear that overflows or truncations never occur under these constraints, therefore school algebra rules apply. If the type extension approach is proven to be sound, and covers more cases than the MAX/4 approach, i'd gladly welcome it.

If the type extension approach is proven to be sound

I lack the full context here, but in my experience Z3 is really great for proving whether certain operations may or may not overflow, using the built-in bitvector type.

(I'm not sure though if that is what is being done here).

I am sure that it covers more cases. My proposal is the following:

- I will try to do the math. However, that is not enough, because some checkers (and checker utility functions exported into the core) are dependent on the overflow. So

- I will also try the modification on some real-world code with most of the checkers enabled and compare the results. This takes lot of time, so until then

- I suggest to continue the review on the iterator checker patches where I put back the workaround, but I separate them from the rest of the code and mark them with a FIXME referring to this patch. Furthermore,

- I remove the dependency from the "Constraint Manager Negates Difference" patch (D35110), so the workaround in the checker can be somewhat simplified if that one is accepted. It is not a real dependency, the negation can be used also in the case we create the symbol difference manually.

What is your opinion about this approach?

I'm totally fine with assuming the MAX/4 constraint on checker side - extension math would still be better than the MAX/4 pattern-matching in core because extension math should be more useful on its own. Otherwise, i'm also fine with MAX/4 `>`/`<` under an off-by-default flag, if this is indeed blocking future reviews. I'm just really uncomfortable with huge manual checker-side symbolic computations, essentially declaring that checkers should make their own solvers. Analyzer progress is already slowed down dramatically by technical debt in a lot of places - we rarely have time to address it, but at least we shouldn't introduce much more. This doesn't, in my opinion, reduce the fascination of that whole thing you've constructed. It is wonderful.

- I suggest to continue the review

I just tried to unblock your progress on the other iterator checker reviews a little bit. I should have done this earlier. My overall impression is that they would be going much easier than the first ones, because they are split in really small neat chunks.

The problem with the type extension approach is that it is mathematically "more correct than it should be". I mean it fixes all overflow cases, even intentional ones which we probably do not want to do. I am not speaking about intentional overflow cases in the checkers or in the core infrastructure which could be fixed but intentional overlfow cases in the code we analyze.

So probably the MAX/4 approach is the correct one. But how to add a flag for this? Is it a flag enabled by the user or is it automatically enabled if the checker is enabled?

I guess it'd be an `-analyzer-config` flag. You can add it to the `AnalyzerOptions` object, which has access to these flags and can be accessed from `AnalysisManager`.

OK, I can do that. BUt how should I call it? The name should be something the user also understands, not referring to some internal stuff. Any ideas?

I guess a lot of these options already refer to internal stuff. Similarly to off-by-default checkers, such option would not be for users to be thinking whether they want it or not, it is a flag to hide an experimental feature from users completely, while still being able to develop it incrementally, until it's ready to be enabled by default.

`-analyzer-config aggressive-relational-comparison-simplification=true` sounds good to me.