Download Raw Diff

Details

Reviewers

vsavchenko
NoQ
steakhal
Szelethus

Commits

rG0a17896fe6fd: [Analyzer][Core] Make SValBuilder to better simplify svals with 3 symbols in…

Summary

Add the capability to simplify more complex constraints where there are 3
symbols in the tree. In this change I extend simplifySVal to query constraints
of children sub-symbols in a symbol tree. (The constraint for the parent is
asked in getKnownValue.)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

martong created this revision.May 28 2021, 6:40 AM

Herald added a reviewer: Szelethus. · View Herald TranscriptMay 28 2021, 6:40 AM

Herald added subscribers: manas, ASDenysPetrov, gamesh411 and 10 others. · View Herald Transcript

martong requested review of this revision.May 28 2021, 6:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 28 2021, 6:40 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

martong added a parent revision: D103314: [Analyzer][solver] Simplify existing constraints when a new constraint is added.May 28 2021, 6:40 AM

martong mentioned this in D102696: [Analyzer] Find constraints that are directly attached to a BinOp.May 28 2021, 6:43 AM

Looking great, thanks!
I have a couple of notes below.

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1155–1156	Do we actually produce them?
1170–1172	It seems like a very common pattern now: SVal Const = getConst(Expr); if (Const.isUndef()) return Visit(Expr); return Const; At least we can make a function out of this, so we don't repeat this pattern over and over. But I was thinking more radically to actually replace `Visit` with this, what do you think?
clang/test/Analysis/simplify-complex-constraints.cpp
25–27	Since it's not really specific to addition, maybe we can have tests for other operators?

Harbormaster completed remote builds in B106702: Diff 348511.May 28 2021, 7:19 AM

ASDenysPetrov added inline comments.Jun 2 2021, 3:47 AM

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1155–1156	You can met SymbolCast here if `evalCast` produced it. See `SValBuilder::evalCastSubKind(nonloc::SymbolVal...)`: if (!Loc::isLocType(CastTy)) if (!IsUnknownOriginalType \|\| !CastTy->isFloatingType() \|\| T->isFloatingType()) return makeNonLoc(SE, T, CastTy); Also my patch will produce a lot of integral cast symbols D103096. You are welcome to examine it.

vsavchenko added inline comments.Jun 2 2021, 4:03 AM

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1155–1156	In my comment I meant `IntSymExpr`

ASDenysPetrov added inline comments.Jun 2 2021, 4:38 AM

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1155–1156	Oh, anyway :-)

ASDenysPetrov added inline comments.Jun 2 2021, 5:01 AM

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1155–1156	I think we should remove ConcreteInt and and as a consequence IntSymExpr, SymIntExpr in the nearest future. SymbolVal should be reworked to store a QualType, a RangeSet and an expression pointer itself. IMO SymSymExpr and SymbolVal is sufficient to make any calculations. It may help generalize a lot of them. ConcreteInt is just a special case for SymbolVal. More than that PointerToMember can be removed as well, because we would get an info being a pointer to member from QualType in this way.

Returning to the discussion raised in D102696, I'd like to share my vision.
I think we can use much easier approach to use valid constraints at any point of time.
The main idea is lazy-reasoning of the ranges.
This approach:

doesn't need to use canonicalization;
doesn't need to update all the constraints on each setConstraint call (using brute-force or any other structure traversals);
add an additional bindings for every simple operand on each assignment or initialization.
remove an invalid bindings for every simple operand on each assignment or initialization.
recursively get the range when and only when needed (lazy-reasoning).

Example:

a = b + 1;
c = a - 5;
if (b != 42) return 0;
if(c == 38) return 1;
return 0;

a = b + 1;

LHS a lookup. LHS not found.
Add binding $a = $b + 1
Traverse through RHS b + 1.
b lookup. b not found.
Built a new expression a - 1 for b.
Add binding for b: $b = $a - 1
1 is a constant. Skip.

c = a - 5;

LHS c lookup. LHS not found.
Add binding $c = $a - 5
Traverse through RHS a - 5.
a lookup. a found.
If LHS not found and a found, then skip.
5 is a constant. Skip.

if (b != 42) return;

Get the range of $b. Direct range not found. Add $b to the list of visited symbols.
Substitute bindings. From $b != 42 to $a - 1 == 42.
Traverse through $a - 1.
Get the range of $a. Direct range not found. Add $a to the list of visited symbols.
1 is a constant. Skip.
Substitute bindings. From $a - 1 == 42 to $b + 1 - 1 == 42.
Traverse through $b + 1 - 1.
$b is in the list of already visited symbols. No way to find a range.
Create a range from the type [MIN, MAX].
Perform comparison and produce a constraint for b [42, 42].

if(c == 38) return;

Get the range of $c. Direct range not found. Add $c to the list of visited symbols.
Substitute bindings. From $c == 38 to $a - 5 == 38.
Traverse through $a - 5.
Get the range of $a. Direct range not found. Add $a to the list of visited symbols.
5 is a constant. Skip.
Substitute bindings. From $a - 5 == 38 to $b + 1 - 5 == 38.
Traverse through $b + 1 - 5.
Get the range of $b. Direct range found [42, 42].
1 is a constant. Skip.
Calculate(simplificate the expression etc.) a new range for $c [38, 38].
Condition is true.

return 1

P.S. I've just started thinking of the way to solve this problem. For now it's just a sketch of where I'm going. I think this can even replace EquivalenceClass.

In D103317#2793557, @ASDenysPetrov wrote:
Returning to the discussion raised in D102696, I'd like to share my vision.
I think we can use much easier approach to use valid constraints at any point of time.
The main idea is lazy-reasoning of the ranges.
This approach:

doesn't need to use canonicalization;

doesn't need to update all the constraints on each setConstraint call (using brute-force or any other structure traversals);

add an additional bindings for every simple operand on each assignment or initialization.

remove an invalid bindings for every simple operand on each assignment or initialization.

recursively get the range when and only when needed (lazy-reasoning).

Example:
a = b + 1;
c = a - 5;
if (b != 42) return 0;
if(c == 38) return 1;
return 0;
a = b + 1;

LHS a lookup. LHS not found.

Add binding $a = $b + 1

Traverse through RHS b + 1.

b lookup. b not found.

Built a new expression a - 1 for b.

Add binding for b: $b = $a - 1

1 is a constant. Skip.

c = a - 5;

LHS c lookup. LHS not found.

Add binding $c = $a - 5

Traverse through RHS a - 5.

a lookup. a found.

If LHS not found and a found, then skip.

5 is a constant. Skip.

if (b != 42) return;

Get the range of $b. Direct range not found. Add $b to the list of visited symbols.

Substitute bindings. From $b != 42 to $a - 1 == 42.

Traverse through $a - 1.

Get the range of $a. Direct range not found. Add $a to the list of visited symbols.

1 is a constant. Skip.

Substitute bindings. From $a - 1 == 42 to $b + 1 - 1 == 42.

Traverse through $b + 1 - 1.

$b is in the list of already visited symbols. No way to find a range.

Create a range from the type [MIN, MAX].

Perform comparison and produce a constraint for b [42, 42].

if(c == 38) return;

Get the range of $c. Direct range not found. Add $c to the list of visited symbols.

Substitute bindings. From $c == 38 to $a - 5 == 38.

Traverse through $a - 5.

Get the range of $a. Direct range not found. Add $a to the list of visited symbols.

5 is a constant. Skip.

Substitute bindings. From $a - 5 == 38 to $b + 1 - 5 == 38.

Traverse through $b + 1 - 5.

Get the range of $b. Direct range found [42, 42].

1 is a constant. Skip.

Calculate(simplificate the expression etc.) a new range for $c [38, 38].

Condition is true.

return 1

P.S. I've just started thinking of the way to solve this problem. For now it's just a sketch of where I'm going.

Great to know that you are also thinking about this problem!
The more the merrier!

I have a couple of notes about your suggestion.
First is minor, a = b + 1 and c = a + 5 won't actually produce any constraints. Starting from those statements if you request a symbol for a you will get b + 1 and b + 6 for c.
Of course, you can change your code so that we make an assumption that a == b + 1 and c == a + 5 and those would be constraints. That's why I said that it's minor.

Now, for the biggest concern. You cover in a very detailed manner all the traversals and caches for visited symbols, but have Substitute bindings as a sub-step , which is HUGE and absolutely non-trivial.
In your examples, you have a linear chain of related symbolic expressions, i.e. something like this x -> y + 2, y -> z + 3, z -> 4 so that we have x -> 9 after three substitutions. But the problem it is generally not one-to-one relationship, so x -> y1 + 1, x -> y2 + 2, ... , x -> yN + N. Imagine that each yK is involved with another M constraints. So, you would need to check x -> y1 -> z11, if it's not working - check x -> y1 -> z12 and so on. When you run out of options for z1K, you need to backtrack and try x -> y2 -> z21, etc. As you can see, this is BRUTE FORCE.

And it couldn't be easier theoretically. So, instead of trying that many options every time we need to judge about, we take a bit simpler shortcut. It's far from being complete, but honestly we're not pursuing completeness. For my example above, whenever we find out that zKM -> 5, we simplify everything that involves zKM and get yK -> 10 and get x -> 15. It is far from being perfect, I don't like that we need to go through all symbols and try to simplify every single symbol, and it would be good to optimize this scenario (mapping symbols to all symbolic expressions that contain them, for example), but essentially it is cheaper in the long run.

In D103317#2793658, @vsavchenko wrote:

But the problem it is generally not one-to-one relationship, so x -> y1 + 1, x -> y2 + 2, ... , x -> yN + N.

In my approach it can't be more then one binding for one symbol. Like:
x = y + z; produces $x = $y + $z, $y = $x - $z, $z = $x - $y.
Then if z = a;, all above should be invalidated (removed), because we can't rely on those expressions any more. And new ones should be added $z = $a and $a = $z.

As you can see, this is BRUTE FORCE.

I would say this is a linked list with stop-markers, not blindly passing through all the bindings. And I don't talk about implementation details. For now we just need to elaborate an approach.

In D103317#2793781, @ASDenysPetrov wrote:

In D103317#2793658, @vsavchenko wrote:

But the problem it is generally not one-to-one relationship, so x -> y1 + 1, x -> y2 + 2, ... , x -> yN + N.

In my approach it can't be more then one binding for one symbol. Like:
x = y + z; produces $x = $y + $z, $y = $x - $z, $z = $x - $y.

Hmm, Okay, but what about situations if you have: a = a1 + a2 and a = a3 + a4 + a5 are you going to throw away one of these constraints? And if so, how do you want to select which one?

In D103317#2793797, @vsavchenko wrote:

Hmm, Okay, but what about situations if you have: a = a1 + a2 and a = a3 + a4 + a5 are you going to throw away one of these constraints? And if so, how do you want to select which one?

Are you talking about comparison or assignment? Both assignments can't be valid at the same time, and latter replaces bindings of former. In case of comparisons, they both can be valid.
But we should keep in mind that assignment is a write operation which replaces and invalidates previous bindings, and comparison is a read operation. It can add new bindings but can not remove old ones.
This is what I haven't dig deep enough yet. Let's do this together how we can handle that.

In D103317#2793911, @ASDenysPetrov wrote:

In D103317#2793797, @vsavchenko wrote:

Hmm, Okay, but what about situations if you have: a = a1 + a2 and a = a3 + a4 + a5 are you going to throw away one of these constraints? And if so, how do you want to select which one?

Are you talking about comparison or assignment? Both assignments can't be valid at the same time, and latter replaces bindings of former. In case of comparisons, they both can be valid.
But we should keep in mind that assignment is a write operation which replaces and invalidates previous bindings, and comparison is a read operation. It can add new bindings but can not remove old ones.
This is what I haven't dig deep enough yet. Let's do this together how we can handle that.

I replied to you earlier that assignments are not producing constraints. The analyzer has some sort of SSA (not really, but anyways), so every time we reassign the variable it actually becomes a different symbol. So, from this perspective DeclRefExpr x and symbol x are two different things:

int x = foo(); // after this DeclRefExpr and VarDecl x are associated with conjured symbol #1 (aka conj#1)
int c = x + 10; // c is associated with `$conj#1 + 10`
x = a + 1; // DeclRefExpr x is now associated with `$a + 1`
x = c + x; // and now it is `conj#1 + 11`

There is no symbolic assignment in the static analyzer. Symbols are values, values don't change. We can only obtain more information of what particular symbolic value represents.

I hope it makes it more clear.

As for equality, we already track equivalence classes of symbols and they can have more than 2 symbols in them. Actually, this whole data structure mostly makes sense for cases when the class has more than 2 elements in it.

! In D103317#2793797, @vsavchenko wrote:

I replied to you earlier that assignments are not producing constraints. The analyzer has some sort of SSA (not really, but anyways), so every time we reassign the variable it actually becomes a different symbol. So, from this perspective DeclRefExpr x and symbol x are two different things:
int x = foo(); // after this DeclRefExpr and VarDecl x are associated with conjured symbol #1 (aka conj#1)
int c = x + 10; // c is associated with `$conj#1 + 10`
x = a + 1; // DeclRefExpr x is now associated with `$a + 1`
x = c + x; // and now it is `conj#1 + 11`
There is no symbolic assignment in the static analyzer. Symbols are values, values don't change. We can only obtain more information of what particular symbolic value represents.

I hope it makes it more clear.

As for equality, we already track equivalence classes of symbols and they can have more than 2 symbols in them. Actually, this whole data structure mostly makes sense for cases when the class has more than 2 elements in it.

Thanks for the clarifications.

I'm trying to say that we need to keep our attention on assignments/initializations as well if we want to solve this problem efficiently. I didn't say that assignments produce constraints. Of course they do not :-). I want we produce more bindings(associations) on assignments to use them as mapping for further substitutions and traversing to get the right range.

As for the equivalence classes. I think we should go away from special cases. I mean what about lower-then classes or greater-or-equal classes. I think you got what I mean. I'm trying to find a general approach which can cover almost all individual improvements around CSA codebase(about ranges).

I started working on this when saw your discussion. Just what you also consider the direction I'm moving. A lot of corner-examples, like you gave me, are welcome. Thank you for the feedbacks!

In D103317#2794099, @ASDenysPetrov wrote:
! In D103317#2793797, @vsavchenko wrote:
I replied to you earlier that assignments are not producing constraints. The analyzer has some sort of SSA (not really, but anyways), so every time we reassign the variable it actually becomes a different symbol. So, from this perspective DeclRefExpr x and symbol x are two different things:
int x = foo(); // after this DeclRefExpr and VarDecl x are associated with conjured symbol #1 (aka conj#1)
int c = x + 10; // c is associated with `$conj#1 + 10`
x = a + 1; // DeclRefExpr x is now associated with `$a + 1`
x = c + x; // and now it is `conj#1 + 11`
There is no symbolic assignment in the static analyzer. Symbols are values, values don't change. We can only obtain more information of what particular symbolic value represents.

I hope it makes it more clear.

As for equality, we already track equivalence classes of symbols and they can have more than 2 symbols in them. Actually, this whole data structure mostly makes sense for cases when the class has more than 2 elements in it.
Thanks for the clarifications.

I'm trying to say that we need to keep our attention on assignments/initializations as well if we want to solve this problem efficiently. I didn't say that assignments produce constraints. Of course they do not :-). I want we produce more bindings(associations) on assignments to use them as mapping for further substitutions and traversing to get the right range.

As for the equivalence classes. I think we should go away from special cases. I mean what about lower-then classes or greater-or-equal classes. I think you got what I mean. I'm trying to find a general approach which can cover almost all individual improvements around CSA codebase(about ranges).

I started working on this when saw your discussion. Just what you also consider the direction I'm moving. A lot of corner-examples, like you gave me, are welcome. Thank you for the feedbacks!

Thank you for your comments Denys! It's great to see that some else is also bothered by these problems, and the more approaches we have the deeper the overall understanding is.

In D103317#2794099, @ASDenysPetrov wrote:
! In D103317#2793797, @vsavchenko wrote:
I replied to you earlier that assignments are not producing constraints. The analyzer has some sort of SSA (not really, but anyways), so every time we reassign the variable it actually becomes a different symbol. So, from this perspective DeclRefExpr x and symbol x are two different things:
int x = foo(); // after this DeclRefExpr and VarDecl x are associated with conjured symbol #1 (aka conj#1)
int c = x + 10; // c is associated with `$conj#1 + 10`
x = a + 1; // DeclRefExpr x is now associated with `$a + 1`
x = c + x; // and now it is `conj#1 + 11`
There is no symbolic assignment in the static analyzer. Symbols are values, values don't change. We can only obtain more information of what particular symbolic value represents.

I hope it makes it more clear.

As for equality, we already track equivalence classes of symbols and they can have more than 2 symbols in them. Actually, this whole data structure mostly makes sense for cases when the class has more than 2 elements in it.
Thanks for the clarifications.

I'm trying to say that we need to keep our attention on assignments/initializations as well if we want to solve this problem efficiently. I didn't say that assignments produce constraints. Of course they do not :-). I want we produce more bindings(associations) on assignments to use them as mapping for further substitutions and traversing to get the right range.

As for the equivalence classes. I think we should go away from special cases. I mean what about lower-then classes or greater-or-equal classes. I think you got what I mean. I'm trying to find a general approach which can cover almost all individual improvements around CSA codebase(about ranges).

I started working on this when saw your discussion. Just what you also consider the direction I'm moving. A lot of corner-examples, like you gave me, are welcome. Thank you for the feedbacks!

Assignments and initializations don't affect anything, we shouldn't care about them at all because they don't affect symbolic values in any way. You keep bringing up assignments/initializations when I repeatedly tell you that these are orthogonal to symbolic values, they simply associate certain symbolic values/expressions with memory regions (aka bind them).

I want to make sure you understand that there is no silver bullet here. There is no one magic method that will resolve everything in polynomial let alone linear time (unless someone proves that NP = P). This problem is inherently exponential. We try to avoid it at all costs (and, thus, suck sometimes).
In general, when SMT-solvers need to solve a set of bit-vector equations, they use a technique called bit blasting - just turning each n-bit integer into n-separate boolean variables, and using SAT solver after that. More advanced solvers use special theories before to reduce dimensionality of the problem, one of them is equality theory. These theories are not limitations or "special cases", they are pieces of fundamental knowledge about the domain. Those are as generic as you can go. One symbolic expression can be equal to multiple symbolic expressions at the same time, it is the fact, not a special case.

I'm sorry for being too harsh here, but this problem IS harder than you think. If you want to approach it correctly, and not waste your time, you should get these fundamentals straight.

@vsavchenko

I appologize for my immaturity and for the audacity of my infantile assumptions. I admit any constructive criticism. Thank you for referencing the theory. I will thoroughly study the domain in all.

As for the assignments and why I brought them up, that's because I just tried to look at the problem from another angle. Consider two snippets:

void f(int a, int b, int c, int x, int y){
  a = b;
  x = y;
  c = a;
  // What can you say about a, x, c?
  // `a` equals to `b` and 'c'
  // `x` equals to `y`
  // `c` equals to `a` and `b`
}

and

void f(int a, int b, int c, int x, int y){
  if(a == b) {
    if(x == y) {
      if(c == a) {
        // What can you say about a, x, c?
        // `a` equals to `b` and 'c'
        // `x` equals to `y`
        // `c` equals to `a` and `b`
      }
    }
  }
}

They both generates the same relationship between the variables. Assignment seemed to me as just a special case in a true branch of == operator and can be substituted with it in calculations. So, I decided to make some assumptions in this way. Digging more, it opened a lot of gaps and lacks for me.
Next time I will investigate deeper to present a finalized solution and not wasting your time explaining me the basics.

In D103317#2803238, @ASDenysPetrov wrote:

@vsavchenko

I appologize for my immaturity and for the audacity of my infantile assumptions. I admit any constructive criticism. Thank you for referencing the theory. I will thoroughly study the domain in all.

No need to apologize! I do like your enthusiasm, great things come from it. But as I said, I don't want you to waste your own time pursuing things that won't work.

As for the assignments and why I brought them up, that's because I just tried to look at the problem from another angle. Consider two snippets:

OK, let me annotate the code, so we can hopefully close this topic once and for all.

void f(int a, int b, int c, int x, int y){

At this point we have 5 memory regions a, b, c, x and y. They represent the memory cells allocated for the corresponding parameters on stack. When we call function, these actually have some values, but we don't know them due to the limitations of the analysis. For this reason, we say that they contain symbolic values. So, let's simply enumerate that so that there is no misunderstanding. Values: #1, #2, #3, #4, and #5. These are not real numbers, we enumerate some unknown values. As I said, we introduced them as initial values for parameters, so our store (mapping from memory to values) looks likes this: a -> #1, b -> #2, c -> #3, x -> #4, y -> #5. We don't know anything about these values at this point.

a = b;

What do we know at this point? Do we get more information about the value that was stored in a before? No, we overwrote this value entirely, and what changed here is the mapping: a -> #2, b -> #2, c -> #3, x -> #4, y -> #5 because b contained value #2. Symbolic value #1 is gone, it's dead. No-one will ever see or use it again.

x = y;

Using the same logic: a -> #2, b -> #2, c -> #3, x -> #5, y -> #5.

c = a;

And again: a -> #2, b -> #2, c -> #2, x -> #5, y -> #5.

  // What can you say about a, x, c?
  // `a` equals to `b` and 'c'
  // `x` equals to `y`
  // `c` equals to `a` and `b`
}

Yes, they are, but we don't need constraints for that. We know which values are associated with each memory region and we can compare them. That's all, this simple. That's why assignments and initializations DON'T MATTER for the solver, these are simply STORE of some symbolic value into memory. Solver cares only about symbolic values, memory is irrelevant.

and

void f(int a, int b, int c, int x, int y){

As before: a -> #1, b -> #2, c -> #3, x -> #4, y -> #5.

if(a == b) {

Now, the value that we imagine to be in memory region a (namely #1) is NOT GONE! It is still there, it is being used. And what we know after this if, is that on one of the paths after it #1 == #2 and on the other path #1 != #2. These are just values that we know nothing about, but we started to gain some knowledge.
So, the store is the same (the values stored in those memory cell are the same as they were before): a -> #1, b -> #2, c -> #3, x -> #4, y -> #5, but now we add constraint #1 == #2.

if(x == y) {

The same here, store is the same, but constraints are now: #1 == #2 and #4 == #5.

if(c == a) {

#1 == #2 == #3 and #4 == #5

        // What can you say about a, x, c?
        // `a` equals to `b` and 'c'
        // `x` equals to `y`
        // `c` equals to `a` and `b`
      }
    }
  }
}

We can actually figure out that #2 == #3 because of equality theory. We only knew #1 == #2 and that #3 == #1, but equality is a transitive binary relation and we reflect that notion in our equivalence classes data-structure.

They both generates the same relationship between the variables. Assignment seemed to me as just a special case in a true branch of == operator and can be substituted with it in calculations. So, I decided to make some assumptions in this way. Digging more, it opened a lot of gaps and lacks for me.

Relationship between variables are the same, but neither solver nor the analyzer reason about things in terms of variables. It's memory and values that the memory carries. In the first example, we have only 2 values stored in 5 memory locations. In the second example, it's 5 values stored in 5 memory locations. On some of the paths we happen to know how those values are related (which is not true for other paths), that's constraints.

I hope now it's clear.

Next time I will investigate deeper to present a finalized solution and not wasting your time explaining me the basics.

We probably should not have discussions like these because they are not directly about the patch on review. If you want to continue this discussion, let's bring it to [cfe-dev].

martong mentioned this in D112296: [Analyzer][solver] Handle adjustments in constraint assignor remainder.Oct 25 2021, 2:44 PM

Rebase on top of parent revisions
Address review comments

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1155–1156	Do we actually produce them? Yes we produce `IntSymExpr`s.
1170–1172	Good idea!
clang/test/Analysis/simplify-complex-constraints.cpp
25–27	Ok, I've added a new test, and modified a bit this test.

Harbormaster completed remote builds in B133918: Diff 386796.Nov 12 2021, 4:05 AM

martong added a parent revision: D113753: [Analyzer][Core] Better simplification in SimpleSValBuilder::evalBinOpNN.Nov 12 2021, 4:06 AM

martong added a parent revision: D113754: [Analyzer][Core] Simplify IntSym in SValBuilder.

martong removed a parent revision: D103314: [Analyzer][solver] Simplify existing constraints when a new constraint is added.Nov 12 2021, 4:09 AM

martong retitled this revision from [Analyzer][engine][solver] Simplify complex constraints to [Analyzer][Core] Make SValBuilder to better simplify svals with 3 symbols in the tree.Nov 12 2021, 4:33 AM

martong edited the summary of this revision. (Show Details)

Update in comments: Unknown -> Undef

Harbormaster completed remote builds in B133928: Diff 386806.Nov 12 2021, 5:22 AM

To me at least, the patch looks good.
Please post some comparative measurements to demonstrate it won't introduce runtime regression.

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1149	Let's be explicit about it.
1153–1156
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp
23 ↗	(On Diff #386806)	It feels odd that in some of your examples you return some value, but in the rest, you don't.
32 ↗	(On Diff #386806)	You could additionally assert that `y == 0` and `z == 0`.
50 ↗	(On Diff #386806)	What if `z` were in the middle? Would it still pass?
65 ↗	(On Diff #386806)	Would the test pass if you were using `z + y != 1` here?

Return explicitly with UndefinedVal
Unify test cases (return 0 -> return)

martong added inline comments.Nov 23 2021, 7:26 AM

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1149	Good point.
1153–1156	Makes sense.
clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp
23 ↗	(On Diff #386806)	yep, I agree, updated.
32 ↗	(On Diff #386806)	`y * z == 0` does not imply that both `y` and `z` are `0`. However, we could expect that one of them is `0`, but currently we don't have such a deduction (this might be done in `ConstraintAssignor` in an independent patch).
50 ↗	(On Diff #386806)	No, that would not pass. This is because we do not handle commutativity, for that we should have a canonical form of the symbol trees (check out point 3 here https://reviews.llvm.org/D102696#2784624).
65 ↗	(On Diff #386806)	No, because we do not handle commutativity.

In D103317#3127318, @steakhal wrote:

To me at least, the patch looks good.
Please post some comparative measurements to demonstrate it won't introduce runtime regression.

Sure!

stats.html184 KBDownload

Teaser:

martong added inline comments.Nov 23 2021, 7:39 AM

clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp
32 ↗	(On Diff #386806)	this might be done in ConstraintAssignor in an independent patch Actually, it is not a good idea to bifurcate there (and architecturally impossible ATM), so we should implement that differently.

In D103317#3148868, @martong wrote:

In D103317#3127318, @steakhal wrote:

To me at least, the patch looks good.
Please post some comparative measurements to demonstrate it won't introduce runtime regression.

Sure!

stats.html184 KBDownload

Teaser:

Please repeat the measurement for openssl. There must have been some interference in the memory consumption. Aside from that the results look great.

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1110	Where did you address this FIXME?

martong marked an inline comment as done.Nov 23 2021, 8:25 AM

martong added inline comments.

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1110	I didn't, but this `FIXME` became obsolete at some point in the past. The reasons: We do support `SymExprs`. In `simplifySVal` we have a full blown SymExpr visitor implemented that checks for constant values in the symbol subtrees. And at L1103 we query the top symbol of tree as well. The git history shows that the FIXME was previously this: `// FIXME: Add support for SymExprs in RangeConstraintManager.` But we do support SymExprs already in RangeConstraintManager ...

steakhal added inline comments.Nov 23 2021, 8:40 AM

clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
1110	Thanks.

Harbormaster completed remote builds in B135633: Diff 389198.Nov 23 2021, 12:19 PM

Add new test case (for no-crash)

Remove superfluous param (int a) from the new test case

I've added a new test case that triggers an assertion. That problem is getting fixed in a new parent patch, which I am about to add to the stack very soon.

Harbormaster completed remote builds in B136157: Diff 389936.Nov 26 2021, 1:45 AM

martong added a parent revision: D114619: [Analyzer][solver] Do not remove the simplified symbol from the eq class.Nov 26 2021, 2:14 AM

Please repeat the measurement for openssl. There must have been some interference in the memory consumption. Aside from that the results look great.

I did. Average memory usage is unaffected.

stats.html46 KBDownload

In D103317#3155191, @martong wrote:

Please repeat the measurement for openssl. There must have been some interference in the memory consumption. Aside from that the results look great.

I did. Average memory usage is unaffected.

stats.html46 KBDownload

It's good enough for me, thanks.

clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp
75 ↗	(On Diff #389936)	I guess, this statement should not be reachable. Please demonstrate it by using the `clang_analyzer_WarnIfReached()`.

martong marked an inline comment as done.Nov 29 2021, 6:57 AM

martong added inline comments.

clang/test/Analysis/svalbuilder-simplify-compound-svals.cpp
75 ↗	(On Diff #389936)	Actually, L72 simplifies to `1 != b`, L73 simplifies to the same, L74 is `b ^ 2 == 1`. And this system has a solution: `b == -1` satisfies both L72 and L74 conditions. So, this line is reachable, correctly and I've updated the test case like so. Note that, however, the ConstraintAssignor is not smart enough to deduce that `b == -1`. I have a WIP branch to accomplish that here.