This is an archive of the discontinued LLVM Phabricator instance.

[Analyzer] Find constraints that are directly attached to a BinOp
AbandonedPublic

Authored by martong on May 18 2021, 9:34 AM.

Details

Summary

Currently, if we have a constraint on a binary operator expression and then we add
another constraint to any of the binop's sub-expression then the first
constraint is not considered by the engine:

int foo() {
  if (x + y != 0)
    return 0;
  clang_analyzer_eval(x + y == 0); // expected-warning{{TRUE}} OK.
  if (y != 0)
    return 0;
  clang_analyzer_eval(y == 0);     // expected-warning{{TRUE}} OK.
  clang_analyzer_eval(x + y == 0); // expected-warning{{TRUE}} FAILS!
}

Interestingly, if we constrain first y and then (x + y) then the
behaviour is what we'd normally expect:

int foo() {
  if (y != 0)
    return 0;
  if (x + y != 0)
    return 0;
  clang_analyzer_eval(x + y == 0); // expected-warning{{TRUE}} OK.
  clang_analyzer_eval(y == 0);     // expected-warning{{TRUE}} OK.
}

The discrepancy stems from the underlying implementation of the engine:
In the latter case, the Environment of the current ExplodedNode has a
(x + y) -> [0, 0] binding. However, in the first case we have a [y] -> 0
binding in the current ExplodedNode, and when we build the SVal for the
(x + y) expression we simply build up the [x + 0] SVal, i.e. we never
try to find the constraint for (x + y).

In this patch, I add a StmtVisitor that builds up an SVal that
constitutes only from LValue SVals. This SVal then can be used to query
the constraint manager, this way we can find the desired constraint.

Alternative implementation could have been to traverse back in the
Exploded Graph to the Node where we can find the original binding of the
x + y expression. This seemed to be a far from ideal approach as we may
not know how far should we traverse back.

Diff Detail

Event Timeline

martong created this revision.May 18 2021, 9:34 AM
martong requested review of this revision.May 18 2021, 9:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2021, 9:34 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
NoQ added a comment.May 18 2021, 9:53 AM

Ok so the state has enough constraints. It's enough to know that $x + $y == 0 and $y == 0 in order to deduce that $x + 0 == 0. But you're saying that we don't know how to infer it so instead of making us able to infer it let's make us ask the solver easier questions. That's a terrible hack but at this point i think we might as well take it. @vsavchenko WDYT, what would it take to fix the solver instead?

Ok so the state has enough constraints. It's enough to know that $x + $y == 0 and $y == 0 in order to deduce that $x + 0 == 0. But you're saying that we don't know how to infer it so instead of making us able to infer it let's make us ask the solver easier questions. That's a terrible hack but at this point i think we might as well take it. @vsavchenko WDYT, what would it take to fix the solver instead?

Actually, the problem is more subtle than that. The solver is already capable of solving such equations, the below test passes on the main branch:

int test(int x, int y) {
  if (y != 0)
    return 0;
  if (x + y != 0)
    return 0;
  clang_analyzer_eval(x + y == 0); // expected-warning{{TRUE}}
  clang_analyzer_eval(y == 0);     // expected-warning{{TRUE}}
  clang_analyzer_eval(x == 0);     // expected-warning{{TRUE}}
  return 0;
}

However, we can generalize the original problem, by using 3 variables. In this case no solver is capable of solving b/c we simply don't have enough equations:

int test(int x, int y, int z) {
  if (x + y + z != 0)
    return 0;
  if (z != 0)
    return 0;
  clang_analyzer_eval(x + y + z == 0);  // UNKNOWN !
  clang_analyzer_eval(z == 0);          // OK.
  return 0;
}

The crux of the problem is that we don't ask the right question from the solver because we are literally unable to ask the right question (as of main). The solver takes an SVal (more precisely a symbolic expression) and can give back the related ranges or a concrete value.
We can see that the constraints are there, and they are correct:

"constraints": [
   { "symbol": "reg_$1<int z>", "range": "{ [0, 0] }" },
   { "symbol": "(reg_$0<int x>) + (reg_$1<int y>) + (reg_$1<int z>)", "range": "{ [0, 0] }" }
 ],

However, we never try to query "(reg_$0<int x>) + (reg_$1<int y>) + (reg_$1<int z>)", instead we ask the solver separately with "(reg_$0<int x>)", "(reg_$0<int y>)" and with ""(reg_$0<int z>)". This is because of how we bind the SVals to the expressions and because of the "constant folding" like mechanism encoded in getKnownValue and in simplifySVal. And, we had bound the ConcreteInt 0 to the ImplicitCastExpr of 'z', previously and this fuels the "constant folding" mechanism (which is very useful in most cases but causes this logical error).

martong edited the summary of this revision. (Show Details)May 18 2021, 12:25 PM
NoQ added a comment.May 18 2021, 5:20 PM

Ok so the state has enough constraints. It's enough to know that $x + $y == 0 and $y == 0 in order to deduce that $x + 0 == 0. But you're saying that we don't know how to infer it so instead of making us able to infer it let's make us ask the solver easier questions. That's a terrible hack but at this point i think we might as well take it. @vsavchenko WDYT, what would it take to fix the solver instead?

Actually, the problem is more subtle than that. The solver is already capable of solving such equations, the below test passes on the main branch:

int test(int x, int y) {
  if (y != 0)
    return 0;
  if (x + y != 0)
    return 0;
  clang_analyzer_eval(x + y == 0); // expected-warning{{TRUE}}
  clang_analyzer_eval(y == 0);     // expected-warning{{TRUE}}
  clang_analyzer_eval(x == 0);     // expected-warning{{TRUE}}
  return 0;
}

In this case the equations are $y == 0 and $x + 0 == 0 which is much easier to solve. This happens for the same reason that your patch is needed in the first place: we're eagerly substituting a constant.

manas added a subscriber: manas.May 18 2021, 5:35 PM

In this case the equations are $y == 0 and $x + 0 == 0 which is much easier to solve.

Yes, you are right.

This happens for the same reason that your patch is needed in the first place: we're eagerly substituting a constant.

Absolutely, that's the point. On the other hand, it is very important to emphasize that we cannot solve this problem with a stronger solver, see my example with 3 variables and two equations above.

Also, another interesting question is (drawn up by @steakhal) if we can/could handle associativity somehow. Commutativity seems to be handled by the constraint manager (alas we'd search for "y + x" when we query "x + y").

My take on this: for whatever approach, we definitely will be able to construct a more nested/complex example so that it doesn't work.
For this patch, I'm wondering about something like this:

int foo() {
  if (z != 0)
    return 0;
  if (x + y + z != 0)
    return 0;
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}} OK.
  if (y != 0)
    return 0;
  clang_analyzer_eval(y == 0);         // expected-warning{{TRUE}} OK.
  clang_analyzer_eval(z == 0);         // expected-warning{{TRUE}} OK.
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}} ?
}

As for the solver, it is something that tormented me for a long time. Is there a way to avoid a full-blown brute force check of all existing constraints and get some knowledge about symbolic expressions by constraints these symbolic expressions are actually part of (right now we can reason about expressions if we know some information about its own parts aka operands)?

NoQ added a comment.EditedMay 19 2021, 1:00 PM

This happens for the same reason that your patch is needed in the first place: we're eagerly substituting a constant.

Absolutely, that's the point. On the other hand, it is very important to emphasize that we cannot solve this problem with a stronger solver, see my example with 3 variables and two equations above.

Well, in that other example, we should be asking about $x + $y + 0 == 0 at some point. And then the solver should be able to deduce it from $x + $y + $z == 0 and $z == 0. If we're not asking this question, that's obviously the first problem to get out of our way.

martong updated this revision to Diff 346966.May 21 2021, 2:45 AM
  • Add test case for commutativity
  • Add test case for Valeriy's case

My take on this: for whatever approach, we definitely will be able to construct a more nested/complex example so that it doesn't work.
For this patch, I'm wondering about something like this:

int foo() {
  if (z != 0)
    return 0;
  if (x + y + z != 0)
    return 0;
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}} OK.
  if (y != 0)
    return 0;
  clang_analyzer_eval(y == 0);         // expected-warning{{TRUE}} OK.
  clang_analyzer_eval(z == 0);         // expected-warning{{TRUE}} OK.
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}} ?
}

I've added a test case for this, and it passes with the patch.

Also, I added another test case for checking that "y + x" should be zero if "x + y" is constrained. This fails. To support that we should be synthesizing all combination of the commutative expressions, which would not scale of course. (Perhaps we could do this only for the upper level expression and this way, the most common case could be supported)

This happens for the same reason that your patch is needed in the first place: we're eagerly substituting a constant.

Absolutely, that's the point. On the other hand, it is very important to emphasize that we cannot solve this problem with a stronger solver, see my example with 3 variables and two equations above.

Well, in that other example, we should be asking about $x + $y + 0 == 0 at some point. And then the solver should be able to deduce it from $x + $y + $z == 0 and $z == 0. If we're not asking this question, that's obviously the first problem to get out of our way.

Yeah, you are right and my analysis was not precise enough. I confirm that currently - at main - we do query $x + $y + 0 == 0. And the solver is incapable of solving it from the given constraints. So, yes, this patch is a "workaround" to ask the solver what does it know from the value of the verbatim expression. And this has it's limitation (see the test case for commutativity), still I think it could make the analyzer more precise.

As for the solver, it is something that tormented me for a long time. Is there a way to avoid a full-blown brute force check of all existing constraints and get some knowledge about symbolic expressions by constraints these symbolic expressions are actually part of (right now we can reason about expressions if we know some information about its own parts aka operands)?

Well, it is a hard question.
I've been thinking about building a "parent" map for the sub-expressions, like we do in the AST (see clang::ParentMap). We could use this parent map to inject new constraints during the "constant folding" mechanism.
So, let's say we have $x + $y = 0 and then when we process $y = 0 then we'd add a new constraint: $x = 0. We could add this new constraint by knowing that we have to visit $x + $y because $y is connected to that in the parent map.
What do you think, could it work?

As for the solver, it is something that tormented me for a long time. Is there a way to avoid a full-blown brute force check of all existing constraints and get some knowledge about symbolic expressions by constraints these symbolic expressions are actually part of (right now we can reason about expressions if we know some information about its own parts aka operands)?

Well, it is a hard question.
I've been thinking about building a "parent" map for the sub-expressions, like we do in the AST (see clang::ParentMap). We could use this parent map to inject new constraints during the "constant folding" mechanism.
So, let's say we have $x + $y = 0 and then when we process $y = 0 then we'd add a new constraint: $x = 0. We could add this new constraint by knowing that we have to visit $x + $y because $y is connected to that in the parent map.
What do you think, could it work?

Yes, this was exactly my line of thinking. Instead of trying to use hard logic every time we are asked a question, we can try to simplify existing constraints based on the new bit of information.
The main problem with this is the tree nature of symbolic expressions. If we have something trivial like $x + $y - sure. But something with lot more nested subexpressions can get tricky really fast. That can be solved if we have a canonical form for the trees (this will resolve the problem with $x + $y versus $y + $x as well). Right now, we have bits of this all around our codebase, but those are more like workarounds as opposed to one general approach.

As for the solver, it is something that tormented me for a long time. Is there a way to avoid a full-blown brute force check of all existing constraints and get some knowledge about symbolic expressions by constraints these symbolic expressions are actually part of (right now we can reason about expressions if we know some information about its own parts aka operands)?

Well, it is a hard question.
I've been thinking about building a "parent" map for the sub-expressions, like we do in the AST (see clang::ParentMap). We could use this parent map to inject new constraints during the "constant folding" mechanism.
So, let's say we have $x + $y = 0 and then when we process $y = 0 then we'd add a new constraint: $x = 0. We could add this new constraint by knowing that we have to visit $x + $y because $y is connected to that in the parent map.
What do you think, could it work?

Yes, this was exactly my line of thinking. Instead of trying to use hard logic every time we are asked a question, we can try to simplify existing constraints based on the new bit of information.
The main problem with this is the tree nature of symbolic expressions. If we have something trivial like $x + $y - sure. But something with lot more nested subexpressions can get tricky really fast. That can be solved if we have a canonical form for the trees (this will resolve the problem with $x + $y versus $y + $x as well). Right now, we have bits of this all around our codebase, but those are more like workarounds as opposed to one general approach.

OK, I have started working on a constraint manager based prototype implementation with a parent map and a hook in setConstraint. The test file in this patch (even the commutativity test) passes with that, so, it looks promising. I hope I can share that next week. Also, I'd like do some performance measurements as well, both on this patch and for the new solver based one.

I think It would be also beneficial to document the semantics. Whenever we say $x + $y it's not entirely clear whether we talk about the addition operation in a mathematical sense or we follow C/C++ language semantics. Right now it's mostly mixed within the analyzer. It would be really nice to see for example how and when wrapping is considered.
What if the types of the operands don't match in bitwidth or signess. This is sort of the reason why ArrayBoundV2 has false positives in some scenarios.

As for the solver, it is something that tormented me for a long time. Is there a way to avoid a full-blown brute force check of all existing constraints and get some knowledge about symbolic expressions by constraints these symbolic expressions are actually part of (right now we can reason about expressions if we know some information about its own parts aka operands)?

Well, it is a hard question.
I've been thinking about building a "parent" map for the sub-expressions, like we do in the AST (see clang::ParentMap). We could use this parent map to inject new constraints during the "constant folding" mechanism.
So, let's say we have $x + $y = 0 and then when we process $y = 0 then we'd add a new constraint: $x = 0. We could add this new constraint by knowing that we have to visit $x + $y because $y is connected to that in the parent map.
What do you think, could it work?

Yes, this was exactly my line of thinking. Instead of trying to use hard logic every time we are asked a question, we can try to simplify existing constraints based on the new bit of information.
The main problem with this is the tree nature of symbolic expressions. If we have something trivial like $x + $y - sure. But something with lot more nested subexpressions can get tricky really fast. That can be solved if we have a canonical form for the trees (this will resolve the problem with $x + $y versus $y + $x as well). Right now, we have bits of this all around our codebase, but those are more like workarounds as opposed to one general approach.

OK, I have started working on a constraint manager based prototype implementation with a parent map and a hook in setConstraint. The test file in this patch (even the commutativity test) passes with that, so, it looks promising. I hope I can share that next week. Also, I'd like do some performance measurements as well, both on this patch and for the new solver based one.

Parent map is more or less trivial for situations like $x + $y, where we simply relate $x and $y. But what should we do for a more general case when $x and $y are expressions themselves. One simple example here is $a +$b +$c as originally constrained and $a + $c that got constrained later on. Should we include $a, $b, $c`, $a + $b, $a + $c, and $b + $c to the parent map to make it work?

And again even with parent map, we still need canonicalization.

Alright, I see your point. I agree that solving the problem of "$a +$b +$c
constrained and then $a + $c got constrained" requires canonicalization.
However, canonicalization seems to be not trivial to implement. And there are
some other easier cases that I think we could (should) solve before starting to
implement canonicalization.

Parent map is just an optimization. The functionality should be working even if
we apply brute-force to go through all existing constraints.

Because of the above, I suggest the following steps. These steps could be
implemented as individual patches that depend on each other and once all of
them accepted then we could merge them in one lock-step.

  1. Update setConstraint to simplify existing constraints (and adding the

simplified constraint) when a new constraint is added. In this step we'd just
simply iterate over all existing constraints and would try to simplfy them with
simplifySVal. This would solve the simplest cases where we have two symbols
in the tree. E.g. these cases would pass:

int test_rhs_further_constrained(int x, int y) {
  if (x + y != 0)
    return 0;
  if (y != 0)
    return 0;
  clang_analyzer_eval(x + y == 0); // expected-warning{{TRUE}}
  clang_analyzer_eval(y == 0);     // expected-warning{{TRUE}}
  return 0;
}

int test_lhs_and_rhs_further_constrained(int x, int y) {
  if (x % y != 1)
    return 0;
  if (x != 1)
    return 0;
  if (y != 2)
    return 0;
  clang_analyzer_eval(x % y == 1); // expected-warning{{TRUE}}
  clang_analyzer_eval(y == 2);     // expected-warning{{TRUE}}
  return 0;
}
  1. Add the capability to simplify more complex constraints, where there are 3

symbols in the tree. In this change I suggest the extension and overhaul of
simplifySVal. This change would make the following cases to pass (notice the
3rd check in each test-case).

int test_left_tree_constrained(int x, int y, int z) {
  if (x + y + z != 0)
    return 0;
  if (x + y != 0)
    return 0;
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}}
  clang_analyzer_eval(x + y == 0);     // expected-warning{{TRUE}}
  clang_analyzer_eval(z == 0);         // expected-warning{{TRUE}}
  x = y = z = 1;
  return 0;
}

int test_right_tree_constrained(int x, int y, int z) {
  if (x + (y + z) != 0)
    return 0;
  if (y + z != 0)
    return 0;
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}}
  clang_analyzer_eval(y + z == 0);     // expected-warning{{TRUE}}
  clang_analyzer_eval(x == 0);         // expected-warning{{TRUE}}
  return 0;
}
  1. Add canonical trees to the solver. The way we should do this is to build

"flat" sub-trees of associative operands. E.g. b + a + c becomes +(a, b, c),
i.e. a tree with a root and 3 children [1]. The ordering of the children
could be done by their addresses (Stmt *) and we could simply store them in an
array. Probably we'll need a new mapping: SymbolRef -> CanonicalTree.

When a new constraint is added then we build up its canonical tree (if not
existing yet) Then, we check each existing constraints' canonical tree to check
whether the newly constrained expression is a sub-expression of that. (The
check could be a simple recursively descending search in the children arrays of
the nodes.) If it is a sub-expression then we simplify that with a customized
canonical tree visitor that will call evalBinOp appropriately.

In this step we should handle only + and *

This would pass the below tests:

int test_left_tree_constrained_commutative(int x, int y, int z) {
  if (x + y + z != 0)
    return 0;
  if (y + x != 0)
    return 0;
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}}
  clang_analyzer_eval(x + y == 0);     // expected-warning{{TRUE}}
  clang_analyzer_eval(z == 0);         // expected-warning{{TRUE}}
  return 0;
}

int test_associative(int x, int y, int z) {
  if (x + y + z != 0)
    return 0;
  if (x + z != 0)
    return 0;
  clang_analyzer_printState();
  clang_analyzer_eval(x + y + z == 0); // expected-warning{{TRUE}}
  clang_analyzer_eval(x + z == 0);     // expected-warning{{TRUE}}
  clang_analyzer_eval(y == 0);         // expected-warning{{TRUE}}
  return 0;
  1. Extend 3) with - and /. lhs - rhs becomes a lhs + (-rhs) and lhs /rhs

is substituted wit lhs * (1/rhs).

  1. There are still some expressions that the solver will not be able to reason about

(e.g. <<), but it may happen that we store a constraint for them. In these
cases it is a good idea to search for the expression in the constraint map, but
without constant folding. This functionality might be solved by step 2). If
not, then https://reviews.llvm.org/D102696 could be a good alternative.

  1. Do performance measurements and enhance perhaps with parent maps.

I am planning to upload a patch for steps 1) and 2) very soon.

[1] Hans Vangheluwe et al, An algorithm to implement a canonical representation
of algebraic expressions and equations in AToM3

Oh, wow. Great plan! I'd like to participate. 😊
I have a few comments:

Alright, I see your point. I agree that solving the problem of "$a +$b +$c
constrained and then $a + $c got constrained" requires canonicalization.

That was actually not an example for canonicalization, but rather that ParentMap will exponentially explode due to the number of combinations of sub-expressions if we want to make it useful for all cases.
If we do want to do something with situations like this ParentMap should be re-thought.

  1. Update setConstraint to simplify existing constraints (and adding the

simplified constraint) when a new constraint is added. In this step we'd just
simply iterate over all existing constraints and would try to simplfy them with
simplifySVal

I like this one! So, to be more specific: If we add a constraint Sym -> C, where C is a constant, we can try to simplify other constrained symbols, where Sym is a sub-expression.
I want to add that now (because of me 😅) we do not store constraints as a map Sym -> RangeSet, it's a bit trickier.
We have a few mappings: Sym -> EquivalenceClass (member-to-class relationship), EquivalenceClass -> RangeSet, EquivalenceClass -> {Sym} (class-to-members relationship), and EquivalenceClass -> {EquivalenceClass} (to track disequalities).
It means that we will need to replace Sym with its simplified version in Sym -> EquivalenceClass and EquivalenceClass -> {Sym}.
Not that it is impossible or something, but it makes it a bit harder as it has more moving pieces.

  1. Add the capability to simplify more complex constraints, where there are 3

symbols in the tree. In this change I suggest the extension and overhaul of
simplifySVal. This change would make the following cases to pass (notice the
3rd check in each test-case).

This sounds great, but we we do need to think about performance, so we should discuss how do we plan to implement such an extension.

  1. Add canonical trees to the solver. The way we should do this is to build

"flat" sub-trees of associative operands. E.g. b + a + c becomes +(a, b, c),
i.e. a tree with a root and 3 children [1]. The ordering of the children
could be done by their addresses (Stmt *) and we could simply store them in an
array. Probably we'll need a new mapping: SymbolRef -> CanonicalTree.

I think that flattening is a good idea (and it is widespread in other solvers), but "flat" sub-tree (aka non-binary tree).
However, it is useful ONLY if we can efficiently support isSubsetOf predicate (specifically) for cases like +(a, b, c) and +(a, c).

I don't think that Sym -> Canonical is a good idea. The main goal of canonicalization is to compact potentially exponential number of equivalent ways of spelling the same symbolic expression and have a single representation for it.
So, it means that we expect that Canonicalization(Sym) will be called for multiple Syms producing identical results. This way Sym -> Canonical is simply a cache for Canonicalization (i.e. maybe shouldn't even be associated with a state).

Then, we check each existing constraints' canonical tree to check
whether the newly constrained expression is a sub-expression of that.

I'm worried about performance implications of this.

  1. Extend 3) with - and /. lhs - rhs becomes a lhs + (-rhs) and lhs /rhs

is substituted wit lhs * (1/rhs).

Is it true for bit-vectors though?

I just noticed that I simply forgot to click submit yesterday 😩