Add the capability to simplify more complex constraints where there are 3 symbols in the tree. In this change I extend simplifySVal to query constraints of children sub-symbols in a symbol tree. (The constraint for the parent is asked in getKnownValue.)
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Looking great, thanks!
I have a couple of notes below.
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1150–1151 | Do we actually produce them? | |
1174–1175 | It seems like a very common pattern now: SVal Const = getConst(Expr); if (Const.isUndef()) return Visit(Expr); return Const; At least we can make a function out of this, so we don't repeat this pattern over and over. | |
clang/test/Analysis/simplify-complex-constraints.cpp | ||
25–27 ↗ | (On Diff #348511) | Since it's not really specific to addition, maybe we can have tests for other operators? |
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1150–1151 | You can met SymbolCast here if evalCast produced it. if (!Loc::isLocType(CastTy)) if (!IsUnknownOriginalType || !CastTy->isFloatingType() || T->isFloatingType()) return makeNonLoc(SE, T, CastTy); Also my patch will produce a lot of integral cast symbols D103096. You are welcome to examine it. |
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1150–1151 | In my comment I meant IntSymExpr |
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1150–1151 | Oh, anyway :-) |
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1150–1151 | I think we should remove ConcreteInt and and as a consequence IntSymExpr, SymIntExpr in the nearest future. SymbolVal should be reworked to store a QualType, a RangeSet and an expression pointer itself. IMO SymSymExpr and SymbolVal is sufficient to make any calculations. It may help generalize a lot of them. ConcreteInt is just a special case for SymbolVal. More than that PointerToMember can be removed as well, because we would get an info being a pointer to member from QualType in this way. |
Returning to the discussion raised in D102696, I'd like to share my vision.
I think we can use much easier approach to use valid constraints at any point of time.
The main idea is lazy-reasoning of the ranges.
This approach:
- doesn't need to use canonicalization;
- doesn't need to update all the constraints on each setConstraint call (using brute-force or any other structure traversals);
- add an additional bindings for every simple operand on each assignment or initialization.
- remove an invalid bindings for every simple operand on each assignment or initialization.
- recursively get the range when and only when needed (lazy-reasoning).
Example:
a = b + 1; c = a - 5; if (b != 42) return 0; if(c == 38) return 1; return 0;
a = b + 1;
- LHS a lookup. LHS not found.
- Add binding $a = $b + 1
- Traverse through RHS b + 1.
- b lookup. b not found.
- Built a new expression a - 1 for b.
- Add binding for b: $b = $a - 1
- 1 is a constant. Skip.
c = a - 5;
- LHS c lookup. LHS not found.
- Add binding $c = $a - 5
- Traverse through RHS a - 5.
- a lookup. a found.
- If LHS not found and a found, then skip.
- 5 is a constant. Skip.
if (b != 42) return;
- Get the range of $b. Direct range not found. Add $b to the list of visited symbols.
- Substitute bindings. From $b != 42 to $a - 1 == 42.
- Traverse through $a - 1.
- Get the range of $a. Direct range not found. Add $a to the list of visited symbols.
- 1 is a constant. Skip.
- Substitute bindings. From $a - 1 == 42 to $b + 1 - 1 == 42.
- Traverse through $b + 1 - 1.
- $b is in the list of already visited symbols. No way to find a range.
- Create a range from the type [MIN, MAX].
- Perform comparison and produce a constraint for b [42, 42].
if(c == 38) return;
- Get the range of $c. Direct range not found. Add $c to the list of visited symbols.
- Substitute bindings. From $c == 38 to $a - 5 == 38.
- Traverse through $a - 5.
- Get the range of $a. Direct range not found. Add $a to the list of visited symbols.
- 5 is a constant. Skip.
- Substitute bindings. From $a - 5 == 38 to $b + 1 - 5 == 38.
- Traverse through $b + 1 - 5.
- Get the range of $b. Direct range found [42, 42].
- 1 is a constant. Skip.
- Calculate(simplificate the expression etc.) a new range for $c [38, 38].
- Condition is true.
return 1
P.S. I've just started thinking of the way to solve this problem. For now it's just a sketch of where I'm going. I think this can even replace EquivalenceClass.
Great to know that you are also thinking about this problem!
The more the merrier!
I have a couple of notes about your suggestion.
First is minor, a = b + 1 and c = a + 5 won't actually produce any constraints. Starting from those statements if you request a symbol for a you will get b + 1 and b + 6 for c.
Of course, you can change your code so that we make an assumption that a == b + 1 and c == a + 5 and those would be constraints. That's why I said that it's minor.
Now, for the biggest concern. You cover in a very detailed manner all the traversals and caches for visited symbols, but have Substitute bindings as a sub-step , which is HUGE and absolutely non-trivial.
In your examples, you have a linear chain of related symbolic expressions, i.e. something like this x -> y + 2, y -> z + 3, z -> 4 so that we have x -> 9 after three substitutions. But the problem it is generally not one-to-one relationship, so x -> y1 + 1, x -> y2 + 2, ... , x -> yN + N. Imagine that each yK is involved with another M constraints. So, you would need to check x -> y1 -> z11, if it's not working - check x -> y1 -> z12 and so on. When you run out of options for z1K, you need to backtrack and try x -> y2 -> z21, etc. As you can see, this is BRUTE FORCE.
And it couldn't be easier theoretically. So, instead of trying that many options every time we need to judge about, we take a bit simpler shortcut. It's far from being complete, but honestly we're not pursuing completeness. For my example above, whenever we find out that zKM -> 5, we simplify everything that involves zKM and get yK -> 10 and get x -> 15. It is far from being perfect, I don't like that we need to go through all symbols and try to simplify every single symbol, and it would be good to optimize this scenario (mapping symbols to all symbolic expressions that contain them, for example), but essentially it is cheaper in the long run.
In my approach it can't be more then one binding for one symbol. Like:
x = y + z; produces $x = $y + $z, $y = $x - $z, $z = $x - $y.
Then if z = a;, all above should be invalidated (removed), because we can't rely on those expressions any more. And new ones should be added $z = $a and $a = $z.
As you can see, this is BRUTE FORCE.
I would say this is a linked list with stop-markers, not blindly passing through all the bindings. And I don't talk about implementation details. For now we just need to elaborate an approach.
Hmm, Okay, but what about situations if you have: a = a1 + a2 and a = a3 + a4 + a5 are you going to throw away one of these constraints? And if so, how do you want to select which one?
Are you talking about comparison or assignment? Both assignments can't be valid at the same time, and latter replaces bindings of former. In case of comparisons, they both can be valid.
But we should keep in mind that assignment is a write operation which replaces and invalidates previous bindings, and comparison is a read operation. It can add new bindings but can not remove old ones.
This is what I haven't dig deep enough yet. Let's do this together how we can handle that.
I replied to you earlier that assignments are not producing constraints. The analyzer has some sort of SSA (not really, but anyways), so every time we reassign the variable it actually becomes a different symbol. So, from this perspective DeclRefExpr x and symbol x are two different things:
int x = foo(); // after this DeclRefExpr and VarDecl x are associated with conjured symbol #1 (aka conj#1) int c = x + 10; // c is associated with `$conj#1 + 10` x = a + 1; // DeclRefExpr x is now associated with `$a + 1` x = c + x; // and now it is `conj#1 + 11`
There is no symbolic assignment in the static analyzer. Symbols are values, values don't change. We can only obtain more information of what particular symbolic value represents.
I hope it makes it more clear.
As for equality, we already track equivalence classes of symbols and they can have more than 2 symbols in them. Actually, this whole data structure mostly makes sense for cases when the class has more than 2 elements in it.
! In D103317#2793797, @vsavchenko wrote:
I replied to you earlier that assignments are not producing constraints. The analyzer has some sort of SSA (not really, but anyways), so every time we reassign the variable it actually becomes a different symbol. So, from this perspective DeclRefExpr x and symbol x are two different things:
int x = foo(); // after this DeclRefExpr and VarDecl x are associated with conjured symbol #1 (aka conj#1) int c = x + 10; // c is associated with `$conj#1 + 10` x = a + 1; // DeclRefExpr x is now associated with `$a + 1` x = c + x; // and now it is `conj#1 + 11`There is no symbolic assignment in the static analyzer. Symbols are values, values don't change. We can only obtain more information of what particular symbolic value represents.
I hope it makes it more clear.
As for equality, we already track equivalence classes of symbols and they can have more than 2 symbols in them. Actually, this whole data structure mostly makes sense for cases when the class has more than 2 elements in it.
Thanks for the clarifications.
I'm trying to say that we need to keep our attention on assignments/initializations as well if we want to solve this problem efficiently. I didn't say that assignments produce constraints. Of course they do not :-). I want we produce more bindings(associations) on assignments to use them as mapping for further substitutions and traversing to get the right range.
As for the equivalence classes. I think we should go away from special cases. I mean what about lower-then classes or greater-or-equal classes. I think you got what I mean. I'm trying to find a general approach which can cover almost all individual improvements around CSA codebase(about ranges).
I started working on this when saw your discussion. Just what you also consider the direction I'm moving. A lot of corner-examples, like you gave me, are welcome. Thank you for the feedbacks!
Thank you for your comments Denys! It's great to see that some else is also bothered by these problems, and the more approaches we have the deeper the overall understanding is.
Assignments and initializations don't affect anything, we shouldn't care about them at all because they don't affect symbolic values in any way. You keep bringing up assignments/initializations when I repeatedly tell you that these are orthogonal to symbolic values, they simply associate certain symbolic values/expressions with memory regions (aka bind them).
I want to make sure you understand that there is no silver bullet here. There is no one magic method that will resolve everything in polynomial let alone linear time (unless someone proves that NP = P). This problem is inherently exponential. We try to avoid it at all costs (and, thus, suck sometimes).
In general, when SMT-solvers need to solve a set of bit-vector equations, they use a technique called bit blasting - just turning each n-bit integer into n-separate boolean variables, and using SAT solver after that. More advanced solvers use special theories before to reduce dimensionality of the problem, one of them is equality theory. These theories are not limitations or "special cases", they are pieces of fundamental knowledge about the domain. Those are as generic as you can go. One symbolic expression can be equal to multiple symbolic expressions at the same time, it is the fact, not a special case.
I'm sorry for being too harsh here, but this problem IS harder than you think. If you want to approach it correctly, and not waste your time, you should get these fundamentals straight.
I appologize for my immaturity and for the audacity of my infantile assumptions. I admit any constructive criticism. Thank you for referencing the theory. I will thoroughly study the domain in all.
As for the assignments and why I brought them up, that's because I just tried to look at the problem from another angle. Consider two snippets:
void f(int a, int b, int c, int x, int y){ a = b; x = y; c = a; // What can you say about a, x, c? // `a` equals to `b` and 'c' // `x` equals to `y` // `c` equals to `a` and `b` }
and
void f(int a, int b, int c, int x, int y){ if(a == b) { if(x == y) { if(c == a) { // What can you say about a, x, c? // `a` equals to `b` and 'c' // `x` equals to `y` // `c` equals to `a` and `b` } } } }
They both generates the same relationship between the variables. Assignment seemed to me as just a special case in a true branch of == operator and can be substituted with it in calculations. So, I decided to make some assumptions in this way. Digging more, it opened a lot of gaps and lacks for me.
Next time I will investigate deeper to present a finalized solution and not wasting your time explaining me the basics.
No need to apologize! I do like your enthusiasm, great things come from it. But as I said, I don't want you to waste your own time pursuing things that won't work.
As for the assignments and why I brought them up, that's because I just tried to look at the problem from another angle. Consider two snippets:
OK, let me annotate the code, so we can hopefully close this topic once and for all.
void f(int a, int b, int c, int x, int y){
At this point we have 5 memory regions a, b, c, x and y. They represent the memory cells allocated for the corresponding parameters on stack. When we call function, these actually have some values, but we don't know them due to the limitations of the analysis. For this reason, we say that they contain symbolic values. So, let's simply enumerate that so that there is no misunderstanding. Values: #1, #2, #3, #4, and #5. These are not real numbers, we enumerate some unknown values. As I said, we introduced them as initial values for parameters, so our store (mapping from memory to values) looks likes this: a -> #1, b -> #2, c -> #3, x -> #4, y -> #5. We don't know anything about these values at this point.
a = b;
What do we know at this point? Do we get more information about the value that was stored in a before? No, we overwrote this value entirely, and what changed here is the mapping: a -> #2, b -> #2, c -> #3, x -> #4, y -> #5 because b contained value #2. Symbolic value #1 is gone, it's dead. No-one will ever see or use it again.
x = y;
Using the same logic: a -> #2, b -> #2, c -> #3, x -> #5, y -> #5.
c = a;
And again: a -> #2, b -> #2, c -> #2, x -> #5, y -> #5.
// What can you say about a, x, c? // `a` equals to `b` and 'c' // `x` equals to `y` // `c` equals to `a` and `b` }
Yes, they are, but we don't need constraints for that. We know which values are associated with each memory region and we can compare them. That's all, this simple. That's why assignments and initializations DON'T MATTER for the solver, these are simply STORE of some symbolic value into memory. Solver cares only about symbolic values, memory is irrelevant.
and
void f(int a, int b, int c, int x, int y){
As before: a -> #1, b -> #2, c -> #3, x -> #4, y -> #5.
if(a == b) {
Now, the value that we imagine to be in memory region a (namely #1) is NOT GONE! It is still there, it is being used. And what we know after this if, is that on one of the paths after it #1 == #2 and on the other path #1 != #2. These are just values that we know nothing about, but we started to gain some knowledge.
So, the store is the same (the values stored in those memory cell are the same as they were before): a -> #1, b -> #2, c -> #3, x -> #4, y -> #5, but now we add constraint #1 == #2.
if(x == y) {
The same here, store is the same, but constraints are now: #1 == #2 and #4 == #5.
if(c == a) {
#1 == #2 == #3 and #4 == #5
// What can you say about a, x, c? // `a` equals to `b` and 'c' // `x` equals to `y` // `c` equals to `a` and `b` } } } }
We can actually figure out that #2 == #3 because of equality theory. We only knew #1 == #2 and that #3 == #1, but equality is a transitive binary relation and we reflect that notion in our equivalence classes data-structure.
They both generates the same relationship between the variables. Assignment seemed to me as just a special case in a true branch of == operator and can be substituted with it in calculations. So, I decided to make some assumptions in this way. Digging more, it opened a lot of gaps and lacks for me.
Relationship between variables are the same, but neither solver nor the analyzer reason about things in terms of variables. It's memory and values that the memory carries. In the first example, we have only 2 values stored in 5 memory locations. In the second example, it's 5 values stored in 5 memory locations. On some of the paths we happen to know how those values are related (which is not true for other paths), that's constraints.
I hope now it's clear.
Next time I will investigate deeper to present a finalized solution and not wasting your time explaining me the basics.
We probably should not have discussions like these because they are not directly about the patch on review. If you want to continue this discussion, let's bring it to [cfe-dev].
- Rebase on top of parent revisions
- Address review comments
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1150–1151 |
Yes we produce IntSymExprs. | |
1174–1175 | Good idea! | |
clang/test/Analysis/simplify-complex-constraints.cpp | ||
25–27 ↗ | (On Diff #348511) | Ok, I've added a new test, and modified a bit this test. |
To me at least, the patch looks good.
Please post some comparative measurements to demonstrate it won't introduce runtime regression.
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1144 | Let's be explicit about it. | |
1148–1151 | ||
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp | ||
24 | It feels odd that in some of your examples you return some value, but in the rest, you don't. | |
33 | You could additionally assert that y == 0 and z == 0. | |
51 | What if z were in the middle? Would it still pass? | |
66 | Would the test pass if you were using z + y != 1 here? |
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1144 | Good point. | |
1148–1151 | Makes sense. | |
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp | ||
24 | yep, I agree, updated. | |
33 | y * z == 0 does not imply that both y and z are 0. However, we could expect that one of them is 0, but currently we don't have such a deduction (this might be done in ConstraintAssignor in an independent patch). | |
51 | No, that would not pass. This is because we do not handle commutativity, for that we should have a canonical form of the symbol trees (check out point 3 here https://reviews.llvm.org/D102696#2784624). | |
66 | No, because we do not handle commutativity. |
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp | ||
---|---|---|
33 |
Actually, it is not a good idea to bifurcate there (and architecturally impossible ATM), so we should implement that differently. |
Please repeat the measurement for openssl. There must have been some interference in the memory consumption. Aside from that the results look great.
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1105 | Where did you address this FIXME? |
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1105 | I didn't, but this FIXME became obsolete at some point in the past. The reasons:
|
clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp | ||
---|---|---|
1105 | Thanks. |
I've added a new test case that triggers an assertion. That problem is getting fixed in a new parent patch, which I am about to add to the stack very soon.
Please repeat the measurement for openssl. There must have been some interference in the memory consumption. Aside from that the results look great.
I did. Average memory usage is unaffected.
It's good enough for me, thanks.
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp | ||
---|---|---|
75 | I guess, this statement should not be reachable. |
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp | ||
---|---|---|
75 | Actually, L72 simplifies to 1 != b, L73 simplifies to the same, L74 is b ^ 2 == 1. And this system has a solution: b == -1 satisfies both L72 and L74 conditions. Note that, however, the ConstraintAssignor is not smart enough to deduce that b == -1. I have a WIP branch to accomplish that here. |
Still looks great.
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp | ||
---|---|---|
75 | Uh, you are right. Thanks. |
Where did you address this FIXME?