This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Replace the vector of ConstraintSets by a single ConstraintSet and a function to merge ConstraintSets
ClosedPublic

Authored by mikhail.ramalho on Jun 25 2018, 1:27 PM.

Details

Summary

The previous approach would add every constraint, from every node, into a vector and encode them all at the end of the path. However, duplicated constraints were not being removed and the SMT formula generated by the refutation manager would contain a huge number of duplicated constraint.

This patch implements some code to merge the constraint over the same symbol, removing duplicates. These are the number before and after the patch:

Project    |  before  |   after  |
tmux       |  283.222 |  137.927 | 
redis      |  614.858 |  457.721 |
openssl    |  308.292 |  317.049 |
twin       |  274.478 |  266.339 |
git        |  547.687 |  528.171 |
postgresql | 2927.495 | 2167.802 |
sqlite3    | 3264.305 | 1392.721 |

Major speedups in tmux and sqlite (less than half of the time), redis and postgresql were about 25% faster while the rest are basically the same.

Diff Detail

Repository
rC Clang

Event Timeline

xazax.hun added a comment.EditedJun 26 2018, 12:27 AM

Ok, it looks like naively just dropping all the constraints at Z3 is not the most efficient way.

I have a question about this approach though. Right now this approach is intersecting the ranges associated with symbols and symbolic expressions between the new and the old state.
I have the feeling, however, this is redundant work. This is exactly what the analyzer is doing during the analysis, and the information should be already there in the exploded graph.

The original refutation patch used a logic that it added the ranges of a symbolic expression exactly once, where the analyzer has the most information about that expression (when the particular symbol is became dead or it is in the last node).
Using that approach we do not need to compare ranges or intersect them, for this reason, I did find that solution simpler.

So one approach is to merge everything the other is to just pick the right ranges from the right places.

Is there any particular reason that you follow a different path? If yes, I would love to see that documented, to make sure nobody will introdue an optimization that does not work in the future.

The original refutation patch used a logic that it added the ranges of a symbolic expression exactly once, where the analyzer has the most information about that expression

I did not like that approach as it creates tight coupling between the visitor and the analyzer internals which are not normally exposed.
I'm not sure it was correct, as after switching away from that we have noticed that some bugs have disappeared (though IIRC we do not fully understand why).
I think the current approach is actually much cleaner.

george.karpenkov requested changes to this revision.Jun 26 2018, 5:42 PM

Nits: would prefer the visitor in a new file now, clarifications on the merging algorithm required.

lib/StaticAnalyzer/Core/BugReporterVisitors.cpp
2368

This visitor starts to get big, and the logic is quite different to what is usually found in BugReporterVisitors. Could we separate it into a different file with it's own header, which would be included by BugReporter.cpp?

2408

I'm not sure what happens in this branch. If the intersection is empty, seems like the state should be infeasible? Should this branch be in fact marked unreachable, since the state should not exist otherwise?

2412

I am quite confused on why do we need both ResultRange and R and why do we assign from one to the other. Maybe this would be fixed if the branch above is marked unreachable?

This revision now requires changes to proceed.Jun 26 2018, 5:42 PM
NoQ added a comment.Jun 26 2018, 6:15 PM

I think @xazax.hun has a point. We should not be intersecting any ranges because older ranges (that are closer to the root of the graph) are always super-sets of the newer ranges. Essentially, for every symbol we need to take either the final range (if it's present in the last program state) or the latest range (from the last state in which it's present) and feed it to the solver. That's all. No intersections. No PreStmtPurgeDeadSymbols program points. That's a fairly normal visitor workflow - that's exactly we have a pair of nodes as arguments to VisitNode. Just observe how state changes.

mikhail.ramalho retitled this revision from [Analyzer] Replace the vector of ConstraintSets by a single ConstraintSet and a function to merge ConstraintSets to [analyzer] Replace the vector of ConstraintSets by a single ConstraintSet and a function to merge ConstraintSets.Jun 27 2018, 6:49 AM

In any case, I think @NoQ and @xazax.hun have a valid point here, the CSA already merges the ranges so this patch feels like duplicated work.

I can have another look at @rnkovacs's patch and try to improve the approach here. What do you think, @george.karpenkov?

lib/StaticAnalyzer/Core/BugReporterVisitors.cpp
2368

Sure.

2408

This is the case where we have an old set:

x = [1,2], [10,20]

and the new one is:

x = [3,4]

When intersecting the [1,2] with [3,4] the range will be empty but both ranges are valid.

lib/StaticAnalyzer/Core/BugReporterVisitors.cpp
2408

I am confused, when and why would that happen?

Implemented the suggested approach (kinda). Instead of adding the constraints when they are removed, this patch adds them when they first appear and, since we walk the bug report backward, it should be the last set of ranges generated by the CSA for a given symbol.

The numbers:

Project    |  current |    v1    |    v2    | 
tmux       |  283.222 |  137.927 |  123.052 |
redis      |  614.858 |  457.721 |  400.347 |
openssl    |  308.292 |  317.049 |  307.149 |
twin       |  274.478 |  266.339 |  245.411 |
git        |  547.687 |  528.171 |  477.335 |
postgresql | 2927.495 | 2167.802 | 2002.526 |
sqlite3    | 3264.305 | 1392.721 | 1028.416 |

current: current version in repo
v1: first version of this patch, which merges the ranges
v2: second version of this patch, which adds the constraints when they first appear.

Overall, v2 is even faster than v1 and the current version across all projects.

@mikhail.ramalho LGTM with a few nits.

I assume you've double checked that the results are the same as what you get without this optimization?

lib/StaticAnalyzer/Core/BugReporterVisitors.cpp
2386

nit: cleaner to just call VisitNode(EndPathNode, nullptr, BRC, BR)

2391

Nit: could drop braces according to LLVM style guide

This revision is now accepted and ready to land.Jun 29 2018, 10:06 AM

@mikhail.ramalho LGTM with a few nits.

I assume you've double checked that the results are the same as what you get without this optimization?

Yes, all the regressions pass and the same bugs are removed from the projects, with and without this patch.

Small requested changes.

This revision was automatically updated to reflect the committed changes.