This is an archive of the discontinued LLVM Phabricator instance.

[analyzer][WIP] Suppress bug reports where a tracked expression's latest value change was a result of an invalidation
AbandonedPublic

Authored by Szelethus on Mar 5 2020, 10:07 AM.

Details

Summary

You may be familiar with reports that are demonstrated in the test file -- While invalidation is a fundamental part of static analysis, it is unfortunately not an under-approximation (resulting in fewer but more precise paths of execution). Invalidation are often incorrect, so this patch attempts to suppress reports where a tracked expression's last value change was a result of an invalidation.

This is solved by a adding a new checker that notes which regions at which ProgramPoint were invalidated, and adding a new BugReporterVisitor to the army of trackExpressionValue related visitors that looks for the last value change, and suppresses the report if it was a previously noted invalidation.

The big scare in such changes is always whether we would suppress so many things, so this patch says little without results. I'll share some as soon as I have them :)

Diff Detail

Event Timeline

Szelethus created this revision.Mar 5 2020, 10:07 AM
NoQ added a comment.EditedMar 5 2020, 11:50 AM

In my head this patch should ideally be reduced to a single if-statement: "This value is a SymbolDerived therefore it was produced by invalidation".

It's harder than that, of course, because some derived symbols are legitimate (i.e., values returned through out-parameters). But this is a separate problem: instead of producing anonymous conjured symbols, invalidation should produce richer symbols that explain what kind of invalidation has happened and which specific effect of this invalidation is represented by this symbol. Or we could instead make better symbols for representing out-parameters.

Szelethus marked an inline comment as done.Mar 5 2020, 12:11 PM
In D75698#1908335, @NoQ wrote:

In my head this patch should ideally be reduced to a single if-statement: "This value is a SymbolDerived therefore it was produced by invalidation".

Getting to this point already forced me to learn a lot about the inner workings of the analyzer, and I suspect I have quite a bit of work ahead of me still. Infact, I'm not even sure what SymbolDerived is. I suspect this approach should be abandoned then?

It's harder than that, of course, because some derived symbols are legitimate (i.e., values returned through out-parameters). But this is a separate problem: instead of producing anonymous conjured symbols, invalidation should produce richer symbols that explain what kind of invalidation has happened and which specific effect of this invalidation is represented by this symbol. Or we could instead make better symbols for representing out-parameters.

Alright, I've got some learning to then :) Thanks!

NoQ added a comment.Mar 5 2020, 12:14 PM

Like, the only reason why we have trackExpressionValue is because our concrete values are indistinguishable from each other. Regardless of where it comes from, 0 (Loc) is the same as all other 0 (Loc)s, so we have to manually observe how it was copied around in order to figure out where it comes from.

Symbols, by design, have their origin as their identity. Whenever we encounter an unknown runtime value, we denote it with a symbol, therefore the symbol by construction has all the information about that runtime value: how it appeared, why it's unknown, and so on.

Because invalidation produces symbols and not concrete values (as it destroys information about the program state, while concrete values are such information), this patch doesn't need to have anything to do with value tracking.

NoQ added a comment.Mar 5 2020, 12:18 PM
In D75698#1908335, @NoQ wrote:

In my head this patch should ideally be reduced to a single if-statement: "This value is a SymbolDerived therefore it was produced by invalidation".

Getting to this point already forced me to learn a lot about the inner workings of the analyzer, and I suspect I have quite a bit of work ahead of me still. Infact, I'm not even sure what SymbolDerived is.

It's a chunk of a bigger symbol. Like, if we already have a symbol for an unknown value of a structure written in a certain region, then the symbol for a field of this structure would be a derived symbol from that first symbol for the field region.

Now, the only reason we have symbols for structures (or sometimes even for entire memory spaces) is because of invalidation. What remains is to tweak the patch in order to account for different kinds of invalidation.

Szelethus abandoned this revision.Mar 5 2020, 12:22 PM

This sounds exciting! Back to work then.

While invalidation is a fundamental part of static analysis, it is unfortunately not an under-approximation (resulting in fewer but more precise paths of execution)

+1 for saying this out :)

NoQ added a comment.Mar 10 2020, 12:44 AM

While invalidation is a fundamental part of static analysis, it is unfortunately not an under-approximation (resulting in fewer but more precise paths of execution)

+1 for saying this out :)

It would have been a correct under-approximation assuming that other parts of the static analyzer were working correctly. The reason why invalidation causes false positives is because we're making state splits on every if-statement and never join back, which causes us to explore more execution paths than we're entitled to by the presumption of no dead code. Eg.,

class C {
  bool x;

  void do_not_change_x();

  void foo() {
    if (x) { // assuming 'x' is true
      do_not_change_x(); // 'x' is invalidated
    }

    // We are not entitled to a state split here
    // over the invalidated value of 'x'.
    if (x) {
      ...
    }
  }
};

Normally such eager state splits don't cause any problems because we tell the users to "simply add an assert, the code is obscure anyway". But when invalidation kicks in, like in this example, such asserts become very ugly. I.e., in this case you'd have to do something like this to suppress the warning and you'll have to do it every time you call do_not_change_x() for every variable that it doesn't change:

bool old_x = x;
do_not_change_x();
assert(x == old_x && "x changed!");
// Also suppress unused variable warning in release builds.
(void)old_x;

P.S. So, like, we could try to emit the warning only if we covered enough execution paths to prove that there's either dead code or the warning is true. Then we would no longer care about invalidation problems. Unfortunately, i don't have any specific suggestion of how to prove such facts for an arbitrary CFG.

P.P.S. Actually you know what, maybe we should only drop the report if the constraint over the invalidated value contradicts the constraint over the old value. That'll make things a bit more complicated and will require a visitor indeed, though hopefully not as complicated as concrete value tracking, as we're still interested in only one region at a time.

NoQ added inline comments.Mar 10 2020, 12:45 AM
clang/test/Analysis/suppress-invalidation-related-reports.cpp
18

In particular, i believe this is a true positive. Because either flag may be set to false by foo() which makes the path feasible, or the if-statement is dead code.

P.S. So, like, we could try to emit the warning only if we covered enough execution paths to prove that there's either dead code or the warning is true. Then we would no longer care about invalidation problems. Unfortunately, i don't have any specific suggestion of how to prove such facts for an arbitrary CFG.

If I understand you correctly, this would mean that we have to reason about all possible execution paths at the same time to do this. Actually, that would be possible only with some kind of a fix-point flow-analysis and clearly the symbolic execution we have in CSA is a completely different beast (it reasons about one path where there is a bug).

P.P.S. Actually you know what, maybe we should only drop the report if the constraint over the invalidated value contradicts the constraint over the old value. That'll make things a bit more complicated and will require a visitor indeed, though hopefully not as complicated as concrete value tracking, as we're still interested in only one region at a time.

How would that be different than proving the feasibility of the path with Z3? Could we reuse Mikhail's work here, or that would be overkill for this task?

NoQ added a comment.Mar 10 2020, 6:02 AM

P.S. So, like, we could try to emit the warning only if we covered enough execution paths to prove that there's either dead code or the warning is true. Then we would no longer care about invalidation problems. Unfortunately, i don't have any specific suggestion of how to prove such facts for an arbitrary CFG.

If I understand you correctly, this would mean that we have to reason about all possible execution paths at the same time to do this. Actually, that would be possible only with some kind of a fix-point flow-analysis and clearly the symbolic execution we have in CSA is a completely different beast (it reasons about one path where there is a bug).

Nah, we just need to cover *enough* paths, not *all* paths. Just take all paths on which we've managed to find a particular bug (we keep track of these paths during de-duplicating anyway) and somehow figure out that it's "enough" paths. The point is to prove that there's dead code if all of these paths are infeasible. I.e., it may require an auxiliary CFG-based analysis that'll reason about all possible execution paths, but it doesn't require us to have much interaction between that analysis and our usual symbolic execution. We already have such auxiliary analysis, eg. see how dead symbol elimination is an all-paths problem and RelaxedLiveVariables analysis is helping us out with it.

See also my old little essay on this subject. I was talking about it in the context of path note generation but the problem is roughly the same here.

I know this is vague but i spent some time trying to come up with an actual solution and failed so far, so take it with a grain of salt^^

P.P.S. Actually you know what, maybe we should only drop the report if the constraint over the invalidated value contradicts the constraint over the old value. That'll make things a bit more complicated and will require a visitor indeed, though hopefully not as complicated as concrete value tracking, as we're still interested in only one region at a time.

How would that be different than proving the feasibility of the path with Z3? Could we reuse Mikhail's work here, or that would be overkill for this task?

This problem is fairly orthogonal to constraint solving. It's about providing constraints, not solving them. Like, we have two different values that we previously thought of as separate, now we're telling the solver that they're, emm, maybe slightly less separate, "hey dude, actually, on second thought, can you please take a look at what would have happened if they were actually equal?". We still don't care how exactly does the solver solve this.