This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Reasoning about comparison expressions in RangeConstraintManager
ClosedPublic

Authored by ASDenysPetrov on Apr 27 2020, 8:00 AM.

Details

Summary

I got an idea how to make RangeConstraintManager​ more sofisticated within comparison operations.
I want you speak out, share your vision about this idea. Briefly speaking, RangeConstraintManager shall care about the sanity of comparison expressions sequences and help reasonably to branch an exploded graph.

The idea came to me as a thing which resolves this bug PR13426.
Let's consider the next snippet:

int foo(int x, int y) {
    int z;
    if (x == y) {
        z = 0;
    }
    if (x != y) {
        z = 1;
    }
    return z;
}

Obviously that z will be initialized, but CSA reports next:

warning: Undefined or garbage value returned to caller
      [core.uninitialized.UndefReturn]
        return z;

It happenes because CSA does not take into account that x == y and x != y are just two opposites, as if it was:

if (x == y) {
    z = 0;
} else {
    z = 1;
}

So my improvements are in handling case above and similar ones. Assume for some comparison x < y we look for any other comparisons with the same operands (x > y, y != x, etc.). Then we do logical calculations and refuse impossible branches.
For example:

  • x < y and x > y at the same time are impossible.
  • x >= y and x != y at the same time makes x > y true only.
  • x == y and y == x are just reversed but the same.

It covers all possible combinations and is described in the next table.

<><=>===!=UnknownX2
<10*00*1
>010*0*1
<=101*1*0
>=01*11*0
==00**101
!=11**010

The table holds states for the preceding comparison operation assuming the latter is true.
For instance, assume x < y is true that means x != y is surely true:

if (x preciding_operation y) // <    | !=      | >
    if (x operation y)       // !=   | >       | <
         tristate            // True | Unknown | False

Columns stands for a preceding operator. Rows stands for a current operator. Each row has exactly two Unknown cases. UnknownX2 means that both Unknown preceding operators are met in code, and there is a special column for that, for example:

if (x >= y)
  if (x != y)
    if (x <= y)
      False only

As a result of processing an example below, we have

and exploded graphs.

void foo(int y, int x) {
  if (x < y) {
    if (y > x);
    if (x > y);
    if (x <= y);
    if (y <= x);
    if (x == y);
    if (y != x);
  }
}

Please, express your thoughts down here, if it is worth to be.

Diff Detail

Event Timeline

ASDenysPetrov created this revision.Apr 27 2020, 8:00 AM
ASDenysPetrov edited the summary of this revision. (Show Details)Apr 27 2020, 8:12 AM
ASDenysPetrov edited the summary of this revision. (Show Details)Apr 27 2020, 8:26 AM
ASDenysPetrov retitled this revision from RangeConstraintManager optimizations in comparison expressions to [analyzer] RangeConstraintManager optimizations in comparison expressions.
ASDenysPetrov edited the summary of this revision. (Show Details)Apr 27 2020, 8:32 AM
ASDenysPetrov edited the summary of this revision. (Show Details)May 1 2020, 2:47 PM
ASDenysPetrov added a reviewer: vsavchenko.
ASDenysPetrov edited the summary of this revision. (Show Details)May 1 2020, 2:59 PM
ASDenysPetrov edited the summary of this revision. (Show Details)
ASDenysPetrov edited the summary of this revision. (Show Details)
ASDenysPetrov marked an inline comment as done.May 6 2020, 1:53 AM
ASDenysPetrov added inline comments.
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1182–1183

Folk, is this a good idea to explicitly create bool ranges as a return value?
As for me, comparisons like >, <, etc. can only produce bool-based ranges, otherwise it would be weird.

ASDenysPetrov added a comment.EditedMay 6 2020, 2:15 AM

Guys,
@xazax.hun, @Charusso, @steakhal, @vsavchenko, @NoQ, @baloghadamsoftware,
please, tell something. What do you think of this improvement?

Guys,
@xazax.hun, @Charusso, @steakhal, @vsavchenko, @NoQ, @baloghadamsoftware,
please, tell something. What do you think of this improvement?

Sorry, I will definitely take a look at this later today! In the meantime, we are always curious about the performance and usability implications of those patches.
Did you do some measurements on some projects how the run-time changes?
Also, it would be useful to see how statistics like coverage patterns, number of exploded nodes etc changes. See the output of -analyzer-stats option for details.

clang/test/Analysis/constraint_manager_conditions.cpp
214

Nit: please add new line at the end of the file.

xazax.hun added inline comments.May 6 2020, 5:02 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1141

I have really hard time processing how to interpret AnyX2.

For example in the code below:

if (x >= y)
  if (x != y)
    if (x <= y)
      return false
if (x >= y)
  if (x == y)
    if (x <= y)
      return true

We would get different results for <=. So I do not really get how I should read the AnyX2 column.

1175

Could LHS and RHS be other expressions? Does it make sense to continue executing this function if one of them is not a simple symbol?

1182–1183

I think modeling booleans with ranges is OK. This is what the analyzer is doing at the moment. But I guess the question is about whether boolean results of comparisons is a good way to store the relationships between the symbols. I am not convinced about that see my inline comment below.

1188

I am not really familiar with how constraints are represented but I vaguely recall the analyzer sometimes normalizing some expressions like conversion A == B to A - B == 0. I am just wondering if this API to look this expression up is not the right abstraction as it might be better to handle such normalizations in a unified, central way.

Also, note that this method does not handle transitivity. I wonder if maintaining set of ranges is the right representation for this information at all. The ordering between the symbols will define a lattice. Representing that lattice directly instead of using ranges might be more efficient.

Thanks for working on this, I do believe the analyzer would greatly profit from better constraint solving capabilities. Unfortunately, we had some troubles in the past trying to improve upon the current status and we had to revert multiple patches. This is why the community is super cautious when it comes to changes like this.

A high-level note:

Currently the constraint solver we have is interval based.
I think this patch makes a step towards a more refined domain: octagons [1].

At this point I wonder, if we should study octagons first, and see if there are some efficient implementation strategies are available.
I do not insist on rewriting the whole constraint manager to be octagon-based, but it would be useful to see what we actually pursuing here and what are the implications.

[1]: https://arxiv.org/pdf/cs/0703084.pdf

ASDenysPetrov marked 5 inline comments as done.EditedMay 6 2020, 7:28 AM

@xazax.hun, many thanks for looking at the patch.

Did you do some measurements on some projects how the run-time changes?

I have never run them before. How I can do this? What project to use as a sample?

[1]: https://arxiv.org/pdf/cs/0703084.pdf

I will definitely read it, thank you.

Also, it would be useful to see how statistics like coverage patterns, number of exploded nodes etc changes.

You can check

and exploded graphs, to see the number of nodes.

See the output of -analyzer-stats option for details.

I didn't find this option in clang --help. I've never use it. Please, explain how to use it.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1141

I'm sorry for the obscure explanation. I was really thinking about how to explain it more clear.
Let's consider this code:

if (x >= y)
  if (x != y)
    if (x <= y)

If 'x >= y' is true, then following x <= y is Any ( can be true or false ). See table in the summary.
If 'x != y' is true, then following x <= y is Any ( can be true or false ).
If 'x >= y && x != y' is true, then following x <= y is False only (it means that we previously met two Any states, what I've called as AnyX2).

1175

I think it makes sense, because (x+3*y) > z and z < (x+3*y) perfectly fits in the logic.

1182–1183

Once I met a case when logic operator < retruns me int-based ranges {0,0} on the false branch and {1, 429496725} (instead of bool-based {1,1}) on the true branch. It confused me, so the question arised.

1188

conversion A == B to A - B == 0.

All my samples x == y worked as x == y. I havn't seen such transformations.

Representing that lattice directly instead of using ranges might be more efficient.

How do you imagine that lattice? Could you detail it?

I have never run them before. How I can do this? What project to use as a sample?

Unfortunately, we do not really have a well-defined set of benchmarks. But as a first step, you could run the analyzer on LLVM before and after applying your patch and measure the run-time. Or you could start with a smaller project (you can pick any well-known project from GitHub or other similar sites) as checking LLVM can be really time-consuming (especially in debug mode). To run the analyzer on a project I recommend using scan-build or CodeChecker. If you feel like, you can also experiment with the CSA-testbench (https://github.com/Xazax-hun/csa-testbench), but it requires quite a lot of effort to set up upfront.

Also, it would be useful to see how statistics like coverage patterns, number of exploded nodes etc changes.

You can check

and exploded graphs, to see the number of nodes.

This is only for a small artificial example. We would like to learn more about how the node count changes for a large real-world project like LLVM. We do not need to see the actual exploded graph for those big projects as they are really hard to interpret, we just want to know the numbers.

See the output of -analyzer-stats option for details.

I didn't find this option in clang --help. I've never use it. Please, explain how to use it.

I forgot to mention, not all of the analyzer flags have a corresponding flag in the driver. You can check all the analyzer related flags in the frontend (as opposed to the driver) using the following command: clang -cc1 -help | grep "analyzer"
You can either add -analyzer-stats to your cc1 invocation or you can use the -Xclang to pass the flag through the driver this way: -Xclang -analyzer-stats. This will modify the output of the analyzer to include additional statistics about the analysis.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1141

Could you tell me how the reasoning would be different for

if (x >= y)
  if (x == y)
    if (x <= y)

?

1175

Makes sense, thanks. Maybe it would be worth to rephrase the comment to note that x and y can also stand for subexpressions, not only for actual symbols.

1182–1183

I see. I do not see any problem with that immediately, but @NoQ is much more well-versed to answer this question.

1188

A lattice is basically a DAG, where the nodes are the values (symbols, or subexpressions), and the edges are ordering. So it defines an ordering. E.g., A is smaller than B, if and only if there is a path from A to B. I am not sure whether this is the right representation, but it would take care of transitivity.

xazax.hun added inline comments.May 6 2020, 7:53 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1141

Oh never mind, after x == y the x <= y would be true. But still, I find this logic hard to explain, understand, and prove correct.
I'd like to see at least some informal reasoning why will it always work. Also, if someone would like to extend this table in the future (e.g. with operator <=>), how to make it in such a way that does not break this logic.

ASDenysPetrov marked 3 inline comments as done.May 6 2020, 9:45 AM

You can either add -analyzer-stats to your cc1 invocation

It helps, thanks.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1141

When I was creating a table it was just 6x6 without AnyX2 column. It should tell of what RangeSet we shall return (true, false or empty). It covers cases with only one preceding condition.
Then I found that two certain sequential conditions can also tell us what happen to the next condition. And those two conditions happened to be described as Any in the table. Thus I deduced one more column for this purpose.
In other words:

if (x >= y)     // 3. Aha, we found `x >= y`, which is also `Any` for `x <= y` in the table
  if (x != y)   // 2. Aha, we found `x == y`, which is `Any` for `x <= y` in the table
    if (x <= y) // 1. Reasoning a RangeSet about this `x <= y`
    // 4. OK, we found two `Any` states. Let's look at the value in `AnyX2` column for `x <= y`.
    // 5. The value is `False`, so return `FalseRangeSet`, assuming CSA builds false branch only.

You can substitute any other row from the table and corresponding Any-comparisons.
Honestly, I don’t know how else to explain this. :)

I find this logic hard to explain, understand, and prove correct.

If you come up with it, It becomes really easy and clear. You'll see. But I agree that AnyX2 can be changed to something more obvious.

1175

Yes! Great!

1188

I will inspect this idea. Thanks.

xazax.hun added inline comments.May 6 2020, 12:01 PM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1141

Ok, in the meantime I revisited the algorithm and I think now I understand why does it work.
I think the main observation I missed was the following: Every row has exactly two Any entries, so we only need to think about one combination per row. I think originally I interpreted your matrix wrong (the transposed one) and it did not really add up.
So I think you should document the fact that each row has exactly two unknown cases. But for each of the rows, if we have both facts, we can actually evaluate the current operator.

The goal is that if someone wants to extend this code later on (e.g. if the standard committee introduces a new operator that we handle), the next person should be able to modify this algorithm without breaking the existing code. So as you can see, adding a new column/row could break this invariant of having exactly 2 unknowns in each row rendering this code incorrect. This is why it is important, to explain what makes the algorithm work.

I might miss something, but I do not really see test cases that are covering these.

I think some renaming might also help others understanding what is going on:

  1. Capitalize variable names, for example, tristate.
  2. Chose a bit more descriptive names. tristate from the previous example describes the type of the variable. But variable names should describe their contents rather than their type. It can be Result in case it describes the result of the operation. I think in 3-valued logic we tend to use Unknown rather than Any. You have code like auto tristate = CmpOpTable[IndexOP][Index];. This is not really readable as it is really hard to tell what index corresponds to what. One of the indexes correspond to the evaluated operator while the other corresponds to the query. The variable names should reflect that.
ASDenysPetrov marked an inline comment as done.May 7 2020, 3:35 AM
ASDenysPetrov added inline comments.
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1141

Thank you. I'll take into account all your notes. I'll update the patch soon.

ASDenysPetrov edited the summary of this revision. (Show Details)
ASDenysPetrov marked an inline comment as done.May 7 2020, 8:41 AM

Updated the patch due to comments of @xazax.hun

clang/test/Analysis/constraint_manager_conditions.cpp
214

I've added a new line and checked twice before creating a patch. But it didn't appear.

xazax.hun marked an inline comment as done.May 8 2020, 4:23 AM

Thanks!

I still have some nits inline, but overall the implementation looks good to me.
I think, however, that we should not only focus on the implementation but on the high-level questions.
Is this the way forward we want? Can we do it more efficiently? What is the performance penalty of this solution?

The measurements I mentioned earlier about runtimes, node count and so on could help answer some of those questions.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1140

I think it would be better to use slightly different terminology. Instead of preceding operator I would say something like "previously known fact".

1150

I think this table together with the operations that transform between indices and operators could be factored out into a separate class. That would make this method slightly cleaner.

1166

In LLVM we usually only use auto for iterators and to avoid repeating the type twice. Auto is ok above with the dynamic_cast, but here and some other places the type should be spelled out.

1221

I think IndexPrevOP is confusing as the "previous" operation is not necessarily the previous operation we evaluated. In fact IndexPrevOP could have been evaluated any time in the past. We query the known facts that are known for the current state, I think this is what the names should reflect. I suggest to use CurrentOP and QueriedOP.

clang/test/Analysis/constraint_manager_conditions.cpp
214

This looks good now. If it misses the new line there is a message saying that explicitly.

ASDenysPetrov marked 3 inline comments as done.May 12 2020, 8:55 AM

The measurements I mentioned earlier about runtimes, node count and so on could help answer some of those questions.

I'm looking for an appropriate project to measure performance, since I have some troubles running the test-suite on Windows.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1150

I'll try to come up how to turn it into the class.

1166

Thanks for the guidance, I'll check the code for implicit types.

1221

No problem. I'll rename them.

Updated due to @xazax.hun comments.

xazax.hun added inline comments.May 13 2020, 7:08 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
91

I think this function should not work with indices. It should work with operators. So the caller of this function never need to invoke IndexFromOp or OpFromIndex making this function easier to use.

Updated. @xazax.hun you were right. Now it looks more readable. Thanks.

ASDenysPetrov added a comment.EditedMay 20 2020, 12:39 AM

I've made some performance measurements on https://github.com/vim/vim project.
Since I'm using Windows I had no chance to do a lot of test on different projects, because of Unix-environment dependency.
I've tested the most weighty files running clang --analyze -Xclang "-Iproto" -Xclang -analyzer-stats src/file.c on each.
In this way I've tested 84 files. And here is a result table (Each row contains a result of testing an individual c/cpp-file).


Total results are:
test 1

BeforeAfterDelta
770,9s760,8s1,31%

test 2

BeforeAfterDelta
791,4s771,8s2,48%

Updated. Rebased with D79232.
@vsavchenko, @NoQ, @xazax.hun please look.

xazax.hun added a comment.EditedMay 29 2020, 5:42 AM

@xazax.hun, any thoughts?

I think we should check it on some more projects. We saw vastly different analyzer behavior on other projects in the past. I think you should be able to run it on any cmake project that supports windows by generating compilation database and using scan-build-py (in case other methods did not work).

Overall, the performance penalty does not look that bad. But I wonder how would this compare to a solution that also handles transitivity. We did not really look into the alternatives yet.

I'd like to know what @vsavchenko thinks as Valeriy is more into this kind of stuff lately.

Sorry for postponing this review.
I'll get to it right away!

ASDenysPetrov added a comment.EditedMay 29 2020, 6:20 AM

@xazax.hun

I think we should check it on some more projects. We saw vastly different analyzer behavior on other projects in the past.

I completely agree with you. But, unfortunately, vim-proj is the only I could squeeze from that bunch.

I think you should be able to run it on any cmake project that supports windows by generating compilation database and using scan-build-py (in case other methods did not work).

I'll appreciate if you'd give some of such projects. Another point is that I don't know how to get printed stats from the scan-build.

But I wonder how would this compare to a solution that also handles transitivity. We did not really look into the alternatives yet.

I've read an article about Octagons you gave me. The article covers much bigger and more complicated scope, then mine solution. For now I'm not really ready to enlarge it, as it needs a deep investigation. I think for now it can be accepted as an intermadiate proposal.

vsavchenko added inline comments.May 29 2020, 7:40 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
32

+1 for this static assert!
It's good to ensure such assumptions, even if those are very unlikely to change

80

I would prefer function names to comply with LLVM coding standards (start with a verb and a lowercase letter
https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

665

I believe that when we use auto we still try to be more verbose with it, so in this case it should be smth like const auto *SSE

689

Sorry for nitpicking, but it seems like the grammar is a bit off in the the all columns except the last part

690

I think that treat is not the best option here. Maybe smith like process will do?

696

Maybe we can name this constant somehow differently to be more verbose?

700–705

Please, correct me if I'm wrong, but I thought that we are iterating over all comparison operators (excluding <=>), which means that if we don't find it on this iteration for let's say x < y then we'll find it (or already did) for x > y. So, my question is - why do we have this clause then?

And it confuses me that RangeSet now corresponds to a reversed operator, while QueriedOP remains the same.

Maybe a good commentary explaining why we need it could help!

ASDenysPetrov marked 6 inline comments as done.May 29 2020, 8:51 AM

Do not apologize for syntax, language or code style remarks. Them make code better and it's really important for me to know about my faults.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
80

I'll add a get prefix. Saying about lower/uppercases, I'd say this is a mess inside this file. I've just decided to use an upper one.

665

I'll do.

689

Do you mean to change to all of the columns exept the last one ('UnknownX2')?

690

I met a lot of treat usage in terms of handle/process among the code, so just followed it.

696

What do you think about FoundRangeSet?

700–705

No-no. Please, look carefully.
At first we try to find X op Y, when op is a comparison operator. If we failed then try to find the reversed (flipped) version of the expression Y reversed(op) X.
Examples: x > y and y < x, x != y and y != x, x <= y and y >= x

We don't care about operand positions, we just want to find any of its previous relation, and which branch of it we are now in.

vsavchenko added a comment.EditedMay 29 2020, 10:34 AM

Great job! Thank you for your work!

I would say, fixing a couple of minor style issues and testing it on a few more projects, and it is ready to land!

Another point is that I don't know how to get printed stats from the scan-build.

That option in scan-build is called -internal-stats.

I completely agree with you. But, unfortunately, vim-proj is the only I could squeeze from that bunch.

My opinion here is that any projects (with a good amount of users or stars on GitHub ^-^ ) will do. If it is hard to build them on Windows, move to the next project. I guess projects with a minimal number of dependencies should be fairly easy on any platform.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
80

I agree that it is a mess, but if we want to reduce it with time, I'd say we should stick to the official style guide.

689

Yep, sounds good!

696

I was thinking about adding more context to the name, something like RelatedRangeSet or similar. To show not only what it is (RangeSet) and how we got it (found), but also its purpose or meaning.

700–705

Oh, I see now. I missed the part where RHS is on the left now :-)

Updated due to comments.

ASDenysPetrov added a comment.EditedJun 11 2020, 5:18 AM

@vsavchenko I've compiled some performance stats using csa-testbench. The result are within the margin of error.

ProjectBeforeAfterDelta
TMUX0:01:130:01:130,00%
Redis0:02:370:02:391,27%
Vim0:03:130:03:140,52%
Xerces0:04:140:04:171,18%
OpenSSL0:04:250:04:260,38%
protobuf0:05:520:05:550,85%
SQLite0:07:150:07:344,37%
PostgreSQL0:08:270:08:371,97%
LLVM1:06:451:06:15-0,75%
TOTAL1:44:011:44:100,14%

Upd: replaced a bit ill-collected statistics with more relevant one.

vsavchenko accepted this revision.Jun 11 2020, 5:48 AM

@vsavchenko I've compiled some performance stats using csa-testbench. The result are within the margin of error.

Awesome! Thank you for your work!

This revision is now accepted and ready to land.Jun 11 2020, 5:48 AM

@xazax.hun,
I've made performance measurements you concerned about. Could you look at it and make decision on this patch?

xazax.hun accepted this revision.Jun 12 2020, 9:26 AM

I would not call the results of the measurement within the margin of error but the results do not look bad. Unless there is some objection from someone else I am ok with committing this but please change the title of revision before committing. When one says optimization we often think about performance. It should say something like reasoning about more comparison operations.

ASDenysPetrov added a comment.EditedJun 15 2020, 2:40 AM

I would not call the results of the measurement within the margin of error but the results do not look bad. Unless there is some objection from someone else I am ok with committing this but please change the title of revision before committing. When one says optimization we often think about performance. It should say something like reasoning about more comparison operations.

Thanks. I'll change the name.
I mentioned it as a "margin of error" because if you run tests on the same build twice you can easily get the similar difference between two tests in a row.

ASDenysPetrov retitled this revision from [analyzer] RangeConstraintManager optimizations in comparison expressions to [analyzer] Reasoning about comparison expressions in RangeConstraintManager.Jun 15 2020, 2:42 AM
This revision was automatically updated to reflect the committed changes.