This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/StaticAnalyzer/Core/BugReporter/
-
clang/
-
StaticAnalyzer/
-
Core/
-
BugReporter/
14/22
BugReporterVisitors.h
-
lib/StaticAnalyzer/Core/
-
StaticAnalyzer/
-
Core/
-
BugReporterVisitors.cpp

Differential D103605

[analyzer] Introduce a new interface for tracking
ClosedPublic

Authored by vsavchenko on Jun 3 2021, 3:11 AM.

Download Raw Diff

Details

Reviewers

NoQ
xazax.hun
martong
steakhal
Szelethus
manas
RedDocMD

Commits

rG0cc3100bf8d1: [analyzer] Introduce a new interface for tracking

Summary

Tracking values through expressions and the stores is fundamental
for producing clear diagnostics. However, the main components
participating in this process, namely trackExpressionValue and
FindLastStoreBRVisitor, became pretty bloated. They have an
interesting dynamic between them (and some other visitors) that
one might call a "chain reaction". trackExpressionValue adds
FindLastStoreBRVisitor, and the latter calls trackExpressionValue.

Because of this design, individual checkers couldn't affect what's
going to happen somewhere in the middle of that chain. Whether they
want to produce a more informative note or keep the overall tracking
going by utilizing some of the domain expertise. This all lead to two
biggest problems that I see:

Some checkers don't use it

This should probably never be the case for path-sensitive checks.

Some checkers incorporated their logic directly into those components

This doesn't make the maintenance easier, breaks multiple
architecture principles, and makes the code harder to read adn
understand, thus, increasing the probability of the first case.

This commit introduces a prototype for a new interface that will be
responsible for tracking. My main idea here was to make operations
that I want have as a checker developer easy to implement and hook
directly into the tracking process.

After that I intend to make the following changes:

Put all of the trackExpressionValue into one handler and replace calls to trackExpressionValue to Tracker::track
Change visitors that call trackExpressionValue to have a parent Tracker, so they re-use the actual tracker that spawned them.
Split trackExpressionValue implementation (at this point some big handler) into smaller handlers based on their semantics.
Refactor all of the note producing code out of the FindLastStoreBRVisitor into a big clunky StoreHandler and call Tracker::handle instead.
Split that one handler into smaller handlers based on their semantics.
Extract certain handlers into checkers

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vsavchenko created this revision.Jun 3 2021, 3:11 AM

Herald added subscribers: ASDenysPetrov, dkrupp, donat.nagy and 5 others. · View Herald TranscriptJun 3 2021, 3:11 AM

vsavchenko requested review of this revision.Jun 3 2021, 3:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2021, 3:11 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

vsavchenko edited the summary of this revision. (Show Details)Jun 3 2021, 3:13 AM

vsavchenko edited the summary of this revision. (Show Details)Jun 3 2021, 3:20 AM

Add docstring for StoreHandler

vsavchenko mentioned this in D103434: [analyzer] Allow visitors to run callbacks on completion.Jun 3 2021, 3:36 AM

Harbormaster completed remote builds in B107428: Diff 349511.Jun 3 2021, 4:04 AM

Tweak some parts of the interface

Great initiative! You haven't mention markInteresting functions, but I think it would be really important to incorporate that functionality here as well. IMHO, markInteresting shouldn't have been part of the public API ever, Checkers should have been calling only the trackExpressionValue (as internally that calls markInteresting anyway).

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
113	I think we should document this class' responsibility with a terse sentence.
231	This naming is a bit too bloated to me, what do you think about this: enum class Position { Front, Back }; addExpressionHandler(Front, ...) {

In D103605#2796111, @martong wrote:

Great initiative! You haven't mention markInteresting functions, but I think it would be really important to incorporate that functionality here as well. IMHO, markInteresting shouldn't have been part of the public API ever, Checkers should have been calling only the trackExpressionValue (as internally that calls markInteresting anyway).

Oh, yes, interesting-ness doesn't provide a very clear interface for the feedback loop that I'm trying to adjust. From the perspective of the use "mark interesting whatever your checker is interested in and the core will figure it out", it's fine. It has a clear message and you simply expect the engine to do its magic. The problem comes when you try to catch something interesting from your own checker-specific visitor. In that use case, you, first of all, really need to understand how trackers traverse the graph and what they might consider interesting, and rely on the fact that you will get the right interesting thingy. So, the predicate checking for interesting-ness available to the users, is not very helpful IMO.

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
113	Sure
231	Maybe `addHighPriorityHandler` and `addLowPriorityHandler`? There is probably no reason to put `Expression` and `Store` into the actual name.

Harbormaster completed remote builds in B107440: Diff 349529.Jun 3 2021, 6:45 AM

Address review comments

vsavchenko marked an inline comment as done.Jun 3 2021, 6:46 AM

vsavchenko added inline comments.

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
231	What do you think about this solution?

vsavchenko added a child revision: D103616: [analyzer] Reimplement trackExpressionValue as ExpressionHandler.Jun 3 2021, 6:52 AM

Harbormaster completed remote builds in B107444: Diff 349537.Jun 3 2021, 7:28 AM

Minor change

Harbormaster completed remote builds in B107451: Diff 349548.Jun 3 2021, 8:15 AM

In D103605#2796141, @vsavchenko wrote:

In D103605#2796111, @martong wrote:

Great initiative! You haven't mention markInteresting functions, but I think it would be really important to incorporate that functionality here as well. IMHO, markInteresting shouldn't have been part of the public API ever, Checkers should have been calling only the trackExpressionValue (as internally that calls markInteresting anyway).

Oh, yes, interesting-ness doesn't provide a very clear interface for the feedback loop that I'm trying to adjust. From the perspective of the use "mark interesting whatever your checker is interested in and the core will figure it out", it's fine. It has a clear message and you simply expect the engine to do its magic. The problem comes when you try to catch something interesting from your own checker-specific visitor. In that use case, you, first of all, really need to understand how trackers traverse the graph and what they might consider interesting, and rely on the fact that you will get the right interesting thingy. So, the predicate checking for interesting-ness available to the users, is not very helpful IMO.

What I mean is that probably we should have the following function(s) provided for the user:

Result track(SVal InterestingSVal, ....);
Result track(const MemRegion* InterestingRegion, ....);
Result track(SymbolRef InterestingSym, ....);

Or even better, only track(SVal InterestingSVal, ....) would be enough to be public. Then we could deprecate markInteresting and all Checkers should use your new Tracker interface. As I see, both trackExpressionValue and markInteresting serves the same purpose: i.e. to do a backward slicing in the ExplodedGraph to select interesing/important nodes that should be presented to the user.

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
231	Sounds good.

In D103605#2796425, @martong wrote:
In D103605#2796141, @vsavchenko wrote:

In D103605#2796111, @martong wrote:

Great initiative! You haven't mention markInteresting functions, but I think it would be really important to incorporate that functionality here as well. IMHO, markInteresting shouldn't have been part of the public API ever, Checkers should have been calling only the trackExpressionValue (as internally that calls markInteresting anyway).

Oh, yes, interesting-ness doesn't provide a very clear interface for the feedback loop that I'm trying to adjust. From the perspective of the use "mark interesting whatever your checker is interested in and the core will figure it out", it's fine. It has a clear message and you simply expect the engine to do its magic. The problem comes when you try to catch something interesting from your own checker-specific visitor. In that use case, you, first of all, really need to understand how trackers traverse the graph and what they might consider interesting, and rely on the fact that you will get the right interesting thingy. So, the predicate checking for interesting-ness available to the users, is not very helpful IMO.

What I mean is that probably we should have the following function(s) provided for the user:
Result track(SVal InterestingSVal, ....);
Result track(const MemRegion* InterestingRegion, ....);
Result track(SymbolRef InterestingSym, ....);
Or even better, only track(SVal InterestingSVal, ....) would be enough to be public. Then we could deprecate markInteresting and all Checkers should use your new Tracker interface. As I see, both trackExpressionValue and markInteresting serves the same purpose: i.e. to do a backward slicing in the ExplodedGraph to select interesing/important nodes that should be presented to the user.

We need to define semantics for those a bit better I think. Not that trackExpressionValue or FindLastStoreBRVisitor are defined very well! But I want to approach it incrementally and first of all make those two extendable.
But of course, whatever tracks something (want to perform backward slicing) should probably go through the tracker in the future.

Add forgotten piece of StoreInfo

Harbormaster completed remote builds in B107528: Diff 349650.Jun 3 2021, 2:00 PM

Ok. Oof. Whoa. That's amazing.

Let me re-iterate to make sure that my understanding of this patchset is correct.

Chapter 1. "A is correlated with B (ρ = 0.56), given C, assuming D and under E conditions"

Two new entities are introduced: trackers and handlers.
Trackers are attached to bug reports. A single bug report may have multiple trackers.
- Why not have only one tracker per report?
Handlers are attached to trackers. Each tracker can have multiple handlers with a priority system.
- Do you envision it growing into a dependency system later?
A subclass of visitors known as "tracking visitors" are connected to trackers. One tracker may have multiple visitors attached to it.
Trackers are a tiny, almost uncustomizable class. It contains is a method called track() that is invoked when attention of handlers is required.
Each tracker comes with a default set of handlers.
The only way to customize a tracker is to add extra handlers to it.
Handlers implement a handle() method invoked by the tracker every time their attention is required in track().
Handlers may return a note from their handle() method. If a handler returns a note, this note is attached to the bug report and lower-priority handlers are not invoked at all.
There are two major classes of handlers: expression handlers and store handlers. Expression handlers pay attention to Environment entries, store handlers pay attention to Region Store bindings.

Chapter 2. "For immediate release: Scientists find potential link between A and B"

Bug visitors are still responsible for discovering connections across exploded nodes. On the contrary, track()/handle() workflow is instantaneous.
- It sounds like this whole patchset is orthogonal to reimplementing value tracking visitors with note tags, as long as we can pass a reference to the appropriate tracker into the tag(?)
Tracking starts when a checker creates a tracker and invokes track() on it, which instantly causes the respective handle() to be invoked.
handle() may attach visitors. These visitors may in turn call track() as they reach a different exploded node.
The termination condition is as usual: no further visitors attached.
- It sounds like this patchset is orthogonal to the problem of detecting the tracking termination point(?) It does make it easier to implement customized reactions to such termination though.

Chapter 3. "A causes B all the time. What will this means for Obama?"

trackExpressionValue() now boils down to creation of a single tracker and an initial invocation of track() on the expression which in turn causes expression handlers to immediately react.
Most of the old body of trackExpressionValue() is transformed into a default expression handler. So it's still part of an immediate reaction, just a bit more generalized.
Most of the visitors that previously composed trackExpressionValue() are still part of the process; they're attached by the default expression handler.
However, instead of adding notes directly, they simply invoke track(). All note-attaching work is moved into additional default handlers.

Chapter 4. "I'm wearing this to ward off A!"

Basic usage in checkers is basically unchanged. They still simply call trackExpressionValue() and it just works.
Advanced usage - which is the whole point of this patchset - probably implies adding custom handlers to the tracker created by trackExpressionValue().
Let's consider an example. Suppose you're implementing evalCall for llvm::dyn_cast. You could have your expression handler detect a modeled dyn_cast call and implement its handle() method to emit a checker-specific note and start tracking the argument.
With this implementation such note will only be emitted if the value is tracked back to dyn_cast. Not every modeled dyn_cast will receive a note but only the one that, say, caused a null dereference on its return value when it failed.
- Now, how does this interact with note tags? The idea behind note tags is so that visitors didn't need to reverse-engineer the analysis.
- It sounds like the handler is a step back to the visitor world: it doesn't know how exactly was dyn_cast modeled during analysis, so it has to reverse-engineer what happened.
- For example, in order to find out whether dyn_cast succeeded or failed, it has to explore the program state to find out which values are passed into the cast and returned from the cast.
- Whereas a note tag would have this information accessible directly.
  - Hence a suggestion. Maybe instead of having a system of handlers, allow visitors to ask the tracker whether track() was invoked at this node on a certain expression or region?
  - I.e., polling instead of callbacks.
  - This would naturally integrate with note tags.
  - This does not solve the priority problem. We could try to solve it separately by, say, enforcing at most one note per exploded node, and then implementing a similar priority system over visitors.
  - Or am I missing something about your solution that makes connection to note tags a non-issue?

In D103605#2798171, @NoQ wrote:

Ok. Oof. Whoa. That's amazing.

Let me re-iterate to make sure that my understanding of this patchset is correct.

Chapter 1. "A is correlated with B (ρ = 0.56), given C, assuming D and under E conditions"

Two new entities are introduced: trackers and handlers.

Correct.

Trackers are attached to bug reports. A single bug report may have multiple trackers.

Why not have only one tracker per report?

That's a good question, we can actually even make it a part of BugReport for that matter. I can though think of one exotic case, where you wait till you encounter certain weird condition, where you need to spawn a very specific customized tracker only starting from that point.

Handlers are attached to trackers. Each tracker can have multiple handlers with a priority system.

Do you envision it growing into a dependency system later?

Yes, I think in order to correctly split some parts of the logic that already exists in trackExpressionValue

A subclass of visitors known as "tracking visitors" are connected to trackers. One tracker may have multiple visitors attached to it.

Correct.

Trackers are a tiny, almost uncustomizable class. It contains is a method called track() that is invoked when attention of handlers is required.

Correct. If you really want to, you can change the default behavior of those track functions.

Each tracker comes with a default set of handlers.

Correct. This is how we migrate existing functionality into the new interface.

The only way to customize a tracker is to add extra handlers to it.

It's not the only one, you can subclass it and add completely different logic. But the main way for the majority of use cases is indeed through handlers.

Handlers implement a handle() method invoked by the tracker every time their attention is required in track().

Correct.

Handlers may return a note from their handle() method. If a handler returns a note, this note is attached to the bug report and lower-priority handlers are not invoked at all.

Not entirely correct, this is correct for store handlers, these are intended to customize messages for events of symbolic value being stored into the region.
Expression handlers return Tracker::Result that simply tells if some visitor was added or not (backward compatibility with trackExpressionValue) and if we should prevent all other handlers from running here.

There are two major classes of handlers: expression handlers and store handlers. Expression handlers pay attention to Environment entries, store handlers pay attention to Region Store bindings.

I would say that expression handler is a way to externally extend trackExpressionValue functionality. Add new supported type of expression that we can track or add checker-specific visitor recursively.

Chapter 2. "For immediate release: Scientists find potential link between A and B"

Bug visitors are still responsible for discovering connections across exploded nodes. On the contrary, track()/handle() workflow is instantaneous.

It is tricky because track can add visitors and they can call it back from VisitNode, and so on to infinity. This is a pattern that we had between trackExpressionValue and FindLastBRVisitor (and ReturnVisitor) and every checker can create their own recursive tracking.

It sounds like this whole patchset is orthogonal to reimplementing value tracking visitors with note tags, as long as we can pass a reference to the appropriate tracker into the tag(?)

Correct, it just makes existing back-slicing tracking easier to customize.

Tracking starts when a checker creates a tracker and invokes track() on it, which instantly causes the respective handle() to be invoked.

handle() may attach visitors. These visitors may in turn call track() as they reach a different exploded node.

Exactly.

The termination condition is as usual: no further visitors attached.

It sounds like this patchset is orthogonal to the problem of detecting the tracking termination point(?) It does make it easier to implement customized reactions to such termination though.

Actually, tracker dies when this happens. But hey, can you provide a use case when the checker really needs to know when this happens?

Chapter 3. "A causes B all the time. What will this means for Obama?"

trackExpressionValue() now boils down to creation of a single tracker and an initial invocation of track() on the expression which in turn causes expression handlers to immediately react.

Correct.

Most of the old body of trackExpressionValue() is transformed into a default expression handler. So it's still part of an immediate reaction, just a bit more generalized.

Correct.

Most of the visitors that previously composed trackExpressionValue() are still part of the process; they're attached by the default expression handler.

Correction: "All of the visitors..."

However, instead of adding notes directly, they simply invoke track(). All note-attaching work is moved into additional default handlers.

No, they call track where they used to call trackExpressionValue or add FindLastStoreBRVisitor.
Right now, only stores received their special customization and can produce notes based on tracker customization.

Chapter 4. "I'm wearing this to ward off A!"

Basic usage in checkers is basically unchanged. They still simply call trackExpressionValue() and it just works.

Correct

Advanced usage - which is the whole point of this patchset - probably implies adding custom handlers to the tracker created by trackExpressionValue().

Correct

Let's consider an example. Suppose you're implementing evalCall for llvm::dyn_cast. You could have your expression handler detect a modeled dyn_cast call and implement its handle() method to emit a checker-specific note and start tracking the argument.

So, in here it would look like this: If I want to support dyn_cast for tracking, I will add extra ExpressionHandler that calls track on the expression inside of llvm::dyn_cast. Additionally, if you want to change the note's message for situations like a = dyn_cast<Ty>(b), you can add a StoreHandler that will produce a note like "a is casted from b".

With this implementation such note will only be emitted if the value is tracked back to dyn_cast. Not every modeled dyn_cast will receive a note but only the one that, say, caused a null dereference on its return value when it failed.

Correct.

Now, how does this interact with note tags? The idea behind note tags is so that visitors didn't need to reverse-engineer the analysis.

I mean, it's just an orthogonal mechanism.

It sounds like the handler is a step back to the visitor world: it doesn't know how exactly was dyn_cast modeled during analysis, so it has to reverse-engineer what happened.

Correct, but during the analysis, checker could've modeled a zillion of different dyn_casts, and we don't want to have notes on all of them.

For example, in order to find out whether dyn_cast succeeded or failed, it has to explore the program state to find out which values are passed into the cast and returned from the cast.

Whereas a note tag would have this information accessible directly.

Maybe we can access the note tag from the handler and kill two birds with one stone?

Hence a suggestion. Maybe instead of having a system of handlers, allow visitors to ask the tracker whether track() was invoked at this node on a certain expression or region?

There are still situations when you don't have tags, but want to hook into the tracking process, so you can add your custom feature. Let's say you modeled a system call foo(a, b) and you want to track a and b. Or more generally if you want to help tracking when it's stuck.
Additionally, handlers help to tear existing trackExpressionValue implementation apart.

I.e., polling instead of callbacks.

This would naturally integrate with note tags.

My counter-suggestion is to create a flag on NoteTag (maybe Prunable already can fit into that, I'm not very familiar with this interface) that means "this tag should appear only when part of the tracking trace", and make a handler-visitor pair that will produce notes with messages in those tags.
IMO, it will prevent a ton of checkers repeating very similar code: make visitor that will check if some expression is tracked and generate a note out of tag

This does not solve the priority problem. We could try to solve it separately by, say, enforcing at most one note per exploded node, and then implementing a similar priority system over visitors.

Or am I missing something about your solution that makes connection to note tags a non-issue?

My understanding is that tags don't rule out tracking. The other way around is also true. My take on this - let's make tracking a bit more flexible, it's not gonna hurt anyone.
Another aspect that I think is very important here is the ability to make this change incrementally and to extract some parts of trackExpressionValue and FindLastStoreBRVisitor back into checkers to make it at least a bit easier for the reader.

In D103605#2798171, @NoQ wrote:

Ok. Oof. Whoa. That's amazing.

Let me re-iterate to make sure that my understanding of this patchset is correct.

Chapter 1. "A is correlated with B (ρ = 0.56), given C, assuming D and under E conditions"

Two new entities are introduced: trackers and handlers.

Trackers are attached to bug reports. A single bug report may have multiple trackers.

Why not have only one tracker per report?

Handlers are attached to trackers. Each tracker can have multiple handlers with a priority system.

Do you envision it growing into a dependency system later?

A subclass of visitors known as "tracking visitors" are connected to trackers. One tracker may have multiple visitors attached to it.

Trackers are a tiny, almost uncustomizable class. It contains is a method called track() that is invoked when attention of handlers is required.

Each tracker comes with a default set of handlers.

The only way to customize a tracker is to add extra handlers to it.

Handlers implement a handle() method invoked by the tracker every time their attention is required in track().

Handlers may return a note from their handle() method. If a handler returns a note, this note is attached to the bug report and lower-priority handlers are not invoked at all.

There are two major classes of handlers: expression handlers and store handlers. Expression handlers pay attention to Environment entries, store handlers pay attention to Region Store bindings.

Chapter 2. "For immediate release: Scientists find potential link between A and B"

Bug visitors are still responsible for discovering connections across exploded nodes. On the contrary, track()/handle() workflow is instantaneous.

It sounds like this whole patchset is orthogonal to reimplementing value tracking visitors with note tags, as long as we can pass a reference to the appropriate tracker into the tag(?)

Tracking starts when a checker creates a tracker and invokes track() on it, which instantly causes the respective handle() to be invoked.

handle() may attach visitors. These visitors may in turn call track() as they reach a different exploded node.

The termination condition is as usual: no further visitors attached.

It sounds like this patchset is orthogonal to the problem of detecting the tracking termination point(?) It does make it easier to implement customized reactions to such termination though.

Chapter 3. "A causes B all the time. What will this means for Obama?"

trackExpressionValue() now boils down to creation of a single tracker and an initial invocation of track() on the expression which in turn causes expression handlers to immediately react.

Most of the old body of trackExpressionValue() is transformed into a default expression handler. So it's still part of an immediate reaction, just a bit more generalized.

Most of the visitors that previously composed trackExpressionValue() are still part of the process; they're attached by the default expression handler.

However, instead of adding notes directly, they simply invoke track(). All note-attaching work is moved into additional default handlers.

Chapter 4. "I'm wearing this to ward off A!"

Basic usage in checkers is basically unchanged. They still simply call trackExpressionValue() and it just works.

Advanced usage - which is the whole point of this patchset - probably implies adding custom handlers to the tracker created by trackExpressionValue().

Let's consider an example. Suppose you're implementing evalCall for llvm::dyn_cast. You could have your expression handler detect a modeled dyn_cast call and implement its handle() method to emit a checker-specific note and start tracking the argument.

With this implementation such note will only be emitted if the value is tracked back to dyn_cast. Not every modeled dyn_cast will receive a note but only the one that, say, caused a null dereference on its return value when it failed.

Now, how does this interact with note tags? The idea behind note tags is so that visitors didn't need to reverse-engineer the analysis.

It sounds like the handler is a step back to the visitor world: it doesn't know how exactly was dyn_cast modeled during analysis, so it has to reverse-engineer what happened.

For example, in order to find out whether dyn_cast succeeded or failed, it has to explore the program state to find out which values are passed into the cast and returned from the cast.

Whereas a note tag would have this information accessible directly.

Hence a suggestion. Maybe instead of having a system of handlers, allow visitors to ask the tracker whether track() was invoked at this node on a certain expression or region?

I.e., polling instead of callbacks.

This would naturally integrate with note tags.

This does not solve the priority problem. We could try to solve it separately by, say, enforcing at most one note per exploded node, and then implementing a similar priority system over visitors.

Or am I missing something about your solution that makes connection to note tags a non-issue?

Now this sounds like a plan. Valeriy, this is something that I missed for such a massive refactor...

xazax.hun added inline comments.Jun 4 2021, 12:44 PM

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
110	I vaguely remember we wanting to put this defensive check suppression in the related checkers. If that is the case, I wonder if this option belongs here.
127	There are some other places were we might have effects that are very similar to stores, like invalidations during conservative function evaluation.
147	During path sensitive analysis we already have a callback for stores. We kind of replicating this logic for bug paths. So my questions are: Do we expect to have additional information here that were not available during the analysis earlier? Do we want to make this as similar to the forward analysis part as possible for developer familiarity?
210	Not directly related to this patch, but I wonder if we want to have unknown values with identities at some point, so we can track them.

In D103605#2798581, @vsavchenko wrote:

In D103605#2798171, @NoQ wrote:

Trackers are attached to bug reports. A single bug report may have multiple trackers.

Why not have only one tracker per report?

That's a good question, we can actually even make it a part of BugReport for that matter. I can though think of one exotic case, where you wait till you encounter certain weird condition, where you need to spawn a very specific customized tracker only starting from that point.

Ok, let's take a simpler example:

int foo(int D) {
  int A = 1;
  int B = 1;
  int C = B - A;
  return D / C; // division by zero
}

And suppose we want to track A and B so that to explain why we think that C is zero. What are pros and cons of:
(1) tracking A and B within the same tracker that already helped us reach C?
(2) terminating the current tracker as it reaches C and spawning two separate trackers for A and B?

The termination condition is as usual: no further visitors attached.

It sounds like this patchset is orthogonal to the problem of detecting the tracking termination point(?) It does make it easier to implement customized reactions to such termination though.

Actually, tracker dies when this happens. But hey, can you provide a use case when the checker really needs to know when this happens?

Inlined defensive check suppressions are a major example. Basically null dereference warnings are suppressed based on properties of the program point at which the tracker terminates. If we know when this happens, we can refactor inlined defensive check suppressions into a checker specific handler.

However, instead of adding notes directly, they simply invoke track(). All note-attaching work is moved into additional default handlers.

No, they call track where they used to call trackExpressionValue or add FindLastStoreBRVisitor.
Right now, only stores received their special customization and can produce notes based on tracker customization.

What's the ultimate plan though, do we want to fully move all notes into handlers?

Let's consider an example. Suppose you're implementing evalCall for llvm::dyn_cast. You could have your expression handler detect a modeled dyn_cast call and implement its handle() method to emit a checker-specific note and start tracking the argument.

So, in here it would look like this: If I want to support dyn_cast for tracking, I will add extra ExpressionHandler that calls track on the expression inside of llvm::dyn_cast. Additionally, if you want to change the note's message for situations like a = dyn_cast<Ty>(b), you can add a StoreHandler that will produce a note like "a is casted from b".

With this implementation such note will only be emitted if the value is tracked back to dyn_cast. Not every modeled dyn_cast will receive a note but only the one that, say, caused a null dereference on its return value when it failed.

Correct.

Now, how does this interact with note tags? The idea behind note tags is so that visitors didn't need to reverse-engineer the analysis.

I mean, it's just an orthogonal mechanism.

It sounds like the handler is a step back to the visitor world: it doesn't know how exactly was dyn_cast modeled during analysis, so it has to reverse-engineer what happened.

Correct, but during the analysis, checker could've modeled a zillion of different dyn_casts, and we don't want to have notes on all of them.

For example, in order to find out whether dyn_cast succeeded or failed, it has to explore the program state to find out which values are passed into the cast and returned from the cast.

Whereas a note tag would have this information accessible directly.

Maybe we can access the note tag from the handler and kill two birds with one stone?

If we can, one way or another, I believe we absolutely should kill two birds with one stone. We shouldn't have two mutually exclusive solutions for two aspects of the same problem. In order to produce good notes, it's necessary to *both* know what happened during analysis *and* know whether the value is tracked.

Hence a suggestion. Maybe instead of having a system of handlers, allow visitors to ask the tracker whether track() was invoked at this node on a certain expression or region?

There are still situations when you don't have tags, but want to hook into the tracking process, so you can add your custom feature. Let's say you modeled a system call foo(a, b) and you want to track a and b. Or more generally if you want to help tracking when it's stuck.

Yes, so like in my a = dyn_cast<...>(b) example the proposed solution would be like:

if (isTracked(a)) {
  emitNote();
  track(b);
}

Such code can be put into either a visitor or a note tag depending on what you have. It is entirely equivalent to the current prevalent idiom

if (isInteresting(a)) {
  emitNote();
  track(b);
}

except it's asking the right question, with your facility allowing to answer it. This approach is very easy to understand for checker developers, basically adds no extra code (doesn't require people to inherit classes).

Additionally, handlers help to tear existing trackExpressionValue implementation apart.

I.e., polling instead of callbacks.

This would naturally integrate with note tags.

My counter-suggestion is to create a flag on NoteTag (maybe Prunable already can fit into that, I'm not very familiar with this interface) that means "this tag should appear only when part of the tracking trace", and make a handler-visitor pair that will produce notes with messages in those tags.

I think this works too. I like the idea that like note tags are for execution paths, we could make "track tags" for value paths (or, well, "value tracks"? - we should really use this concept more often, it's very effective; this patchset separates these two notions of a path that we were previously confusing).

It sounds a bit more restrictive on what kinds of conditions are allowed. For example, if we have auto [a, b] = foo(c), we may decide to track c either when both a and b are tracked, or when only one of them is tracked; I think both approaches could be reasonable.

In particular, a simple flag (a-la Prunable) is not enough. There needs to be a way to communicate *which* value in the state is of interest. The data structure describing such conditions functionally may potentially grow pretty large and this may be a reason to go imperative but I don't think we'll actually run into this problem soon.

IMO, it will prevent a ton of checkers repeating very similar code: make visitor that will check if some expression is tracked and generate a note out of tag

Yes, absolutely, any solution must eliminate the need for such visitors.

NoQ added inline comments.Jun 4 2021, 1:18 PM

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
127	Yes, this would be really interesting. Like, we probably don't want to add notes saying "we invalidated the Store so here's why we're now assuming stuff that seemingly contradicts our previous assumptions". But we may want to add suppressions based on this info, eg. "our control flow dependencies are based on invalidation, maybe let's try to find a different path to the same bug?"
210	`UnknownVal` is a stop-gap for situations when we've screwed up so badly we don't even have types anymore. The only thing I'd ever want from them is to disappear :) I guess this could be useful for a debug checker that could explain where does an `UnknownVal` come from. In this case unknown values don't need identities; we can track other values without identities, such as null pointers, just fine.

In D103605#2799853, @NoQ wrote:
In D103605#2798581, @vsavchenko wrote:

In D103605#2798171, @NoQ wrote:

Trackers are attached to bug reports. A single bug report may have multiple trackers.

Why not have only one tracker per report?

That's a good question, we can actually even make it a part of BugReport for that matter. I can though think of one exotic case, where you wait till you encounter certain weird condition, where you need to spawn a very specific customized tracker only starting from that point.

Ok, let's take a simpler example:
int foo(int D) {
  int A = 1;
  int B = 1;
  int C = B - A;
  return D / C; // division by zero
}
And suppose we want to track A and B so that to explain why we think that C is zero. What are pros and cons of:
(1) tracking A and B within the same tracker that already helped us reach C?
(2) terminating the current tracker as it reaches C and spawning two separate trackers for A and B?

The termination condition is as usual: no further visitors attached.

It sounds like this patchset is orthogonal to the problem of detecting the tracking termination point(?) It does make it easier to implement customized reactions to such termination though.

Actually, tracker dies when this happens. But hey, can you provide a use case when the checker really needs to know when this happens?

Inlined defensive check suppressions are a major example. Basically null dereference warnings are suppressed based on properties of the program point at which the tracker terminates. If we know when this happens, we can refactor inlined defensive check suppressions into a checker specific handler.

It seems to me that it is a hard concept to wrap your mind around. And even if I would tell someone how we suppress null pointer inline checks, I wouldn't use "when the tracker stops" as a description of the event. It is very vague and tightly coupled with the implementation of tracker stopping, and it is not connected semantically to the problem we are trying to solve. Correct me if I'm wrong, but better description would be track when we constrained the pointer value to null, and if it in the child stack frame, drop it.
So, maybe instead of very hard to understand "when the tracker stops", we can use "when the constraint changes" or "when the constraint became this". This is more similar to store handlers in their nature, we find a pair of states where some condition stops holding (man, that can have a generic implementation!), describe it, and call the handler.

However, instead of adding notes directly, they simply invoke track(). All note-attaching work is moved into additional default handlers.

No, they call track where they used to call trackExpressionValue or add FindLastStoreBRVisitor.
Right now, only stores received their special customization and can produce notes based on tracker customization.

What's the ultimate plan though, do we want to fully move all notes into handlers?

There was no such plan, honestly. I had one problem in mind: it is hard to detect stores yourself, FindLastStoreBRVisitor is very tightly coupled with trackExpressionValue, and you either put your code in it, or... forget about. Yep, there is no other option!
And it's not unheard of that checkers want to know how the value got into this very region, and it is not unheard of wanting to change messages on those. This component seems to be generic enough in identifying stores and classifying them, but horrible in the way that it gives you a fixed set of notes: leave it or take it.
I don't instantly see how other visitors from that set have necessities to be customized, but we already found one about changing constraints to be useful.

Let's consider an example. Suppose you're implementing evalCall for llvm::dyn_cast. You could have your expression handler detect a modeled dyn_cast call and implement its handle() method to emit a checker-specific note and start tracking the argument.

So, in here it would look like this: If I want to support dyn_cast for tracking, I will add extra ExpressionHandler that calls track on the expression inside of llvm::dyn_cast. Additionally, if you want to change the note's message for situations like a = dyn_cast<Ty>(b), you can add a StoreHandler that will produce a note like "a is casted from b".

With this implementation such note will only be emitted if the value is tracked back to dyn_cast. Not every modeled dyn_cast will receive a note but only the one that, say, caused a null dereference on its return value when it failed.

Correct.

Now, how does this interact with note tags? The idea behind note tags is so that visitors didn't need to reverse-engineer the analysis.

I mean, it's just an orthogonal mechanism.

It sounds like the handler is a step back to the visitor world: it doesn't know how exactly was dyn_cast modeled during analysis, so it has to reverse-engineer what happened.

Correct, but during the analysis, checker could've modeled a zillion of different dyn_casts, and we don't want to have notes on all of them.

For example, in order to find out whether dyn_cast succeeded or failed, it has to explore the program state to find out which values are passed into the cast and returned from the cast.

Whereas a note tag would have this information accessible directly.

Maybe we can access the note tag from the handler and kill two birds with one stone?

If we can, one way or another, I believe we absolutely should kill two birds with one stone. We shouldn't have two mutually exclusive solutions for two aspects of the same problem. In order to produce good notes, it's necessary to *both* know what happened during analysis *and* know whether the value is tracked.

I don't think that isTracked predicate is better than isInteresting. It is yet another way to mark things, and if we go this way, we should fix interesting-ness instead.
I still think that there is a use-case for expression handlers: custom handler -> custom visitor -> track -> custom handler -> custom visitor -> ...
This back-and-forth recursive addition of visitors is impossible right now. If the checker wants to implement their own FindLastStoreBRVisitor, they need to put it into trackExpressionValue.

Hence a suggestion. Maybe instead of having a system of handlers, allow visitors to ask the tracker whether track() was invoked at this node on a certain expression or region?

There are still situations when you don't have tags, but want to hook into the tracking process, so you can add your custom feature. Let's say you modeled a system call foo(a, b) and you want to track a and b. Or more generally if you want to help tracking when it's stuck.

Yes, so like in my a = dyn_cast<...>(b) example the proposed solution would be like:
if (isTracked(a)) {
  emitNote();
  track(b);
}
Such code can be put into either a visitor or a note tag depending on what you have. It is entirely equivalent to the current prevalent idiom
if (isInteresting(a)) {
  emitNote();
  track(b);
}
except it's asking the right question, with your facility allowing to answer it. This approach is very easy to understand for checker developers, basically adds no extra code (doesn't require people to inherit classes).

The existing pattern doesn't work, though. I honestly don't like how this tracking (as a fundamental process) is organized at all. It doesn't really matter what predicate or callback we introduce, it is still vague. It should be way more specific, like "I am tracking this value". Tracking involves a hell-lot of other things that we get "for free", but don't really understand them.
For this reason, "interesting-ness" and "being tracked" have (at least for now) exactly the same property, we get them "for free", but don't really understand. Not that callbacks don't have this problem, but they seem to be more flexible at least.

Additionally, handlers help to tear existing trackExpressionValue implementation apart.

I.e., polling instead of callbacks.

This would naturally integrate with note tags.

My counter-suggestion is to create a flag on NoteTag (maybe Prunable already can fit into that, I'm not very familiar with this interface) that means "this tag should appear only when part of the tracking trace", and make a handler-visitor pair that will produce notes with messages in those tags.

I think this works too. I like the idea that like note tags are for execution paths, we could make "track tags" for value paths (or, well, "value tracks"? - we should really use this concept more often, it's very effective; this patchset separates these two notions of a path that we were previously confusing).

We can sketch a plan for that.

It sounds a bit more restrictive on what kinds of conditions are allowed. For example, if we have auto [a, b] = foo(c), we may decide to track c either when both a and b are tracked, or when only one of them is tracked; I think both approaches could be reasonable.

We need to get these straight before we start changing things.

In particular, a simple flag (a-la Prunable) is not enough. There needs to be a way to communicate *which* value in the state is of interest. The data structure describing such conditions functionally may potentially grow pretty large and this may be a reason to go imperative but I don't think we'll actually run into this problem soon.

IMO, it will prevent a ton of checkers repeating very similar code: make visitor that will check if some expression is tracked and generate a note out of tag

Yes, absolutely, any solution must eliminate the need for such visitors.

I want to add something (because I might seem defensive). I do want to make the world peace, but this series of patchiest is addressing a lot of headache me (and Deep) had because of the way trackExpressionValue is organized.
If you think about it, all these patches just do one thing - extract piece of code into a separate function, class, or abstraction.
Fundamentally it is all the same.

I think we can discuss future plans, but also try to be realistic about the scope of this refactoring/redesign, since I'm trying to morph existing code into something a little better.

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
127	Sure, nothing actually prevents us from extending this enum (and I actually do this, since I forgot call arguments here). However, at this point this models and tears apart `FindLastStoreBRVisitor`. The main idea here was that many checkers are benefiting from understanding how certain value got bound with a certain region. But it is so intertwined with `trackExpressionValue` that you couldn't do anything like this in your own code. And even simply the code that understands the node where the store happens is hard enough to make it abstract it away.
147	Oof, I actually think that this one is trickier. If the checker modeled some operation (most likely `evalCall`), it can bind a value to the region by itself. And this handler triggers for such events as well because it simply finds the place where the switch was flipped and the region changed its binding.
210	+1 for not caring about `UnknownVal`.

xazax.hun added inline comments.Jun 4 2021, 3:58 PM

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
147	If the checker modeled some operation (most likely evalCall), it can bind a value to the region by itself. And this handler triggers for such events as well because it simply finds the place where the switch was flipped and the region changed its binding. This is such a great point, I'd like that to be captured as a comment somewhere.

In D103605#2800101, @vsavchenko wrote:

Actually, tracker dies when this happens. But hey, can you provide a use case when the checker really needs to know when this happens?

Inlined defensive check suppressions are a major example. Basically null dereference warnings are suppressed based on properties of the program point at which the tracker terminates. If we know when this happens, we can refactor inlined defensive check suppressions into a checker specific handler.

It seems to me that it is a hard concept to wrap your mind around. And even if I would tell someone how we suppress null pointer inline checks, I wouldn't use "when the tracker stops" as a description of the event. It is very vague and tightly coupled with the implementation of tracker stopping, and it is not connected semantically to the problem we are trying to solve. Correct me if I'm wrong, but better description would be track when we constrained the pointer value to null, and if it in the child stack frame, drop it.
So, maybe instead of very hard to understand "when the tracker stops", we can use "when the constraint changes" or "when the constraint became this". This is more similar to store handlers in their nature, we find a pair of states where some condition stops holding (man, that can have a generic implementation!), describe it, and call the handler.

Inlined defensive check suppressions are indeed pretty ugly; this problem should ideally be solved in a completely different manner.

That said, I strongly disagree that "when the tracker stops" is a hard-to-understand concept. It's an extremely natural and intuitive question to ask:

"Where does the value come from?"

It's fairly reasonable to wonder what happened during analysis that caused us to produce this value for the first time, which then traveled down to our bug point, and under what circumstances did it happen. Not much precise reasoning can be based on this information but it's a very good piece of information to drive imprecise reasoning and heuristics, such as (but not limited to) inlined defensive check suppressions (where it bears a relatively precise meaning that corresponds exactly to "state splits in nested stack frames are intrinsically less reliable as they can't ever be derived from presumption of no dead code").

Another example: a value of a top-level parameter variable is much more likely to be arbitrary than a return value of an unknown function call (if the user wanted pre-conditions they could have added an assert), which in turn is much more likely to be arbitrary than a value loaded from a heap region after store invalidation from unknown function call (we're 99% sure that the function doesn't touch this particular heap region so it's probably just the old value). In this sense, I'm pretty tempted to suppress bug paths where two branch conditions (ultimately going into opposite directions) are based on different values of the same heap variable after different number of invalidations. But I'm certainly against doing so for two different top-level parameter values. You may say that in these cases the information is most likely available from the structure of the symbol (SymbolRegionValue vs. SymbolConjured vs. SymbolDerived) but it only confirms my point: origins of values are so important that we even carry them as part of the value's identity! Direct access to the structure of the symbol for determining origins is also used a few times (eg., see call sites of getOriginRegion()).

I want to add something (because I might seem defensive). I do want to make the world peace, but this series of patchiest is addressing a lot of headache me (and Deep) had because of the way trackExpressionValue is organized.
If you think about it, all these patches just do one thing - extract piece of code into a separate function, class, or abstraction.
Fundamentally it is all the same.

I think we can discuss future plans, but also try to be realistic about the scope of this refactoring/redesign, since I'm trying to morph existing code into something a little better.

What I'm trying to do here is to see a general direction in which we're moving. We have the benefit of being able to predict things. We don't have to be agile; our problem is well-defined from the start and will never really change. There's a bit of flexibility required to support a variety of different future checkers, yes, but that's about it. And then, again, it's also not entirely arbitrary: we have a pretty good amount of data on what our checkers typically do and it's unlikely to expand drastically in the near future. So we don't always need to go for the most flexible solution, but there's often benefit in going for a simpler solution that simply covers our predictable needs and nothing else.

As a historical example of such process, currently bug visitors perform a single pass over the bug report, from bottom to top. This wasn't always this way: originally visitors traversed the bug report multiple times (occasionally 30+ times), adding new visitors along the way, until the set of visitors attached to the bug report reached a fixed point. The original solution was very flexible; it allowed us to backreference values at the bottom in a flash of hindsight and have all visitors restarted gracefully. But in the entirety of our foreseeable circumstances this wasn't necessary because the ultimate purpose of having visitors was to answer the question "why do we think this situation has indeed occured in the program?" by adding notes to the path, and the answer to that question is always above the current point in the path. Such simplification really helped.

This is why i'm trying to understand the entire thing before opting into a system. Especially when it comes to checker APIs for which overengineering is particularly dangerous - and I'm still not sure how exactly the new checker APIs are intended to work after a few pages of discussion. I don't care how they work as much as I care about knowing how they are ultimately intended to work and whether it's simple enough to understand and explain.

May i ask you to sketch the intended solution for @RedDocMD's problem and/or for your RetainCountChecker problem, i.e. how it'll look in the checker code? Or in any imaginary checker of your choice? Like, what's the motivating example? That could help me a lot with understanding your approach.

The existing pattern doesn't work, though. I honestly don't like how this tracking (as a fundamental process) is organized at all. It doesn't really matter what predicate or callback we introduce, it is still vague. It should be way more specific, like "I am tracking this value". Tracking involves a hell-lot of other things that we get "for free", but don't really understand them.

Interesting, can you give an example (say, in case of @RedDocMD's problem) where isTracked() would give an incorrect result but "I am tracking this value" would give a correct result? I guess such situations will arise more often as we add more tracking, but still.

I think it's a good time to step back and talk about statefulness of this whole thing.

Path sensitive checkers are stateless; their state is instead encoded in the ProgramState structure. This is essential because execution paths are branchy and nodes are modeled in a fairly arbitrary order.

Bug visitors are stateful. This is fine because every bug path is linear and traversed in order. Value tracks are linear too - whelp, most of the time, I suppose, but we don't mind basically forking our entire stateful machine whenever this is violated.

Note tags, on the other hand, are stateless again by design. This is why they are designed to rely on information in the bug report, such as interestingness, for communication across tags. Interestingness is an older concept that is simply re-used in note tags but from the perspective of note tags interestingness is a very simple thing: it's their state.

Information attached to the bug report is arbitrary. We can extend it as much as we want, make a data structure equivalent of the GenericDataMap that would track checker-specific variations over the concept of interestingness. It could decorate the specific role of every value as much as necessary. If we need to specify that this value is the same value that's seen by that other note tag (or participated in the emission of the warning itself in a certain manner), we can totally do that.

Hence the question, do we really need to preserve this whole imperative concept of "I" in "I am tracking this value", or is the functional concept of "This value is playing a certain role" entirely sufficient for our purposes?

In D103605#2801681, @NoQ wrote:

Inlined defensive check suppressions are indeed pretty ugly; this problem should ideally be solved in a completely different manner.

That said, I strongly disagree that "when the tracker stops" is a hard-to-understand concept. It's an extremely natural and intuitive question to ask:

"Where does the value come from?"

It's fairly reasonable to wonder what happened during analysis that caused us to produce this value for the first time, which then traveled down to our bug point, and under what circumstances did it happen. Not much precise reasoning can be based on this information but it's a very good piece of information to drive imprecise reasoning and heuristics, such as (but not limited to) inlined defensive check suppressions (where it bears a relatively precise meaning that corresponds exactly to "state splits in nested stack frames are intrinsically less reliable as they can't ever be derived from presumption of no dead code").

Another example: a value of a top-level parameter variable is much more likely to be arbitrary than a return value of an unknown function call (if the user wanted pre-conditions they could have added an assert), which in turn is much more likely to be arbitrary than a value loaded from a heap region after store invalidation from unknown function call (we're 99% sure that the function doesn't touch this particular heap region so it's probably just the old value). In this sense, I'm pretty tempted to suppress bug paths where two branch conditions (ultimately going into opposite directions) are based on different values of the same heap variable after different number of invalidations. But I'm certainly against doing so for two different top-level parameter values. You may say that in these cases the information is most likely available from the structure of the symbol (SymbolRegionValue vs. SymbolConjured vs. SymbolDerived) but it only confirms my point: origins of values are so important that we even carry them as part of the value's identity! Direct access to the structure of the symbol for determining origins is also used a few times (eg., see call sites of getOriginRegion()).

OK, let's make a pause right here. We again start to go into other topics. You come from a pure perspective that track explains how the value got here and does only that. In this picture of the world, track has one starting point and one ending point. If it ends to early (due to the lack of domain knowledge), the checker can pick up and restart the tracking if needed. If it ends for real, this is the earliest where we got this value from and it can tell us about the warning and situation we are in. The problem is that tracking is fundamentally dirty the way it is. It is not linear, and when we track one simple value, it actually spawns a gigantic tree of traces. Thus when we get to the point where it all ended, it actually ended 100 times. It ended in multiple different visitors that were tracking all kinds of things not limited to the value itself. trackExpressionValue is hard also because of this reason. Inline defensive checks don't belong there and clutter it.

I DON'T KNOW how to simplify this ATM. Probably we can decouple ONE VALUE TRACKING from the rest of it, but it's not what I'm trying to do. I want to refactor the existing solution (with all of its inherent design flaws) into something more extensible. And I think that again ATM, with the current state of things, the concept of "tracking stopped here" is doomed.

What I'm trying to do here is to see a general direction in which we're moving. We have the benefit of being able to predict things. We don't have to be agile; our problem is well-defined from the start and will never really change. There's a bit of flexibility required to support a variety of different future checkers, yes, but that's about it. And then, again, it's also not entirely arbitrary: we have a pretty good amount of data on what our checkers typically do and it's unlikely to expand drastically in the near future. So we don't always need to go for the most flexible solution, but there's often benefit in going for a simpler solution that simply covers our predictable needs and nothing else.

As a historical example of such process, currently bug visitors perform a single pass over the bug report, from bottom to top. This wasn't always this way: originally visitors traversed the bug report multiple times (occasionally 30+ times), adding new visitors along the way, until the set of visitors attached to the bug report reached a fixed point. The original solution was very flexible; it allowed us to backreference values at the bottom in a flash of hindsight and have all visitors restarted gracefully. But in the entirety of our foreseeable circumstances this wasn't necessary because the ultimate purpose of having visitors was to answer the question "why do we think this situation has indeed occured in the program?" by adding notes to the path, and the answer to that question is always above the current point in the path. Such simplification really helped.

This is why i'm trying to understand the entire thing before opting into a system. Especially when it comes to checker APIs for which overengineering is particularly dangerous - and I'm still not sure how exactly the new checker APIs are intended to work after a few pages of discussion. I don't care how they work as much as I care about knowing how they are ultimately intended to work and whether it's simple enough to understand and explain.

It's not that it came from out of nowhere. That I just decided that trackExpressionValue is not extensible enough and put this all together. We saw with our own eyes that it didn't work for RetainCountChecker or SmartPtrChecker and started to come up with ridiculous workarounds to address it. I also see all the different pieces of other checkers that live inside of BugReportersVisitors.cpp. And they are there not because of checker developers failing to understand checker APIs, but rather tracking APIs failing them.

May i ask you to sketch the intended solution for @RedDocMD's problem and/or for your RetainCountChecker problem, i.e. how it'll look in the checker code? Or in any imaginary checker of your choice? Like, what's the motivating example? That could help me a lot with understanding your approach.

It is a bit weird to me after you correctly named all the properties of the proposed solution and described how it will look like in the new architecture, but OK.

Right now, if we go up the chain of tracking calls, we get to the point where tracker gets stuck or stops.
Instead of understanding that it stops, and modifying a hell-lot of code, and passing callbacks literally everywhere, I suggest we don't let it stop in the first place.
Let's say we have this example:

...
d = foo(e);
c = d;
Type b = c;
a = b;
reportPoint(a);

Here foo is either dyn_cast for RetainCountChecker or e.get() in SmartPtrChecker.
FindLastBRVisitor gets to d = foo(e) understands that this is assignment (maybe even can say that it gets the value stored into e), calls trackExpressionValue for foo(e) and it does nothing.

It is pretty high up the chain, so we probably don't want to repeat the logic of getting up there (@RedDocMD attempted that), but we couldn't insert our logic into this intimate dance of trackExpressionValue and FindLastBRVisitor.

With the new solution, if you want the most simple ways of addressing it, you create a handler:

class FooHandler : public ExpressionHandler {
public:
  using ExpressionHandler::ExpressionHandler;

  Tracker::Result handle(const Expr *E, const ExplodedNode *InputNode,
                         const ExplodedNode *ExprNode,
                         TrackingOptions Opts) {
    if (auto *Arg = getFooArgument(E)) {
      return getParentTracker().track(E, ExprNode, Opts);
    }
    return {};
};

Essentially these handlers is my way of first introducing new logic for these two checkers, and second to extract similar logic that exists there from all other checkers.

Then, let's say, I decided that I don't like the note "d is assigned here". It doesn't seem very specific and adding more notes on top will overwhelm our poor user even more. So, I decided to make notes: "d is foo'ed from e", "g is initialized to foo from e", and "p's foo is used as the 1st argument".
Right now, there is literally no way to do this except for adding this directly to already bloated FindLastStoreBRVisitor. Why do you think it already produces notes about "garbage values" and "null/nil pointers"?

So, the checker implements a StoreHandler where they handle all the cases. It is important to have it exactly like this because FindLastStoreBRVisitor has non-trivial logic of catching "stores" and we don't want this logic to get copied into different checkers.

And if we are talking where it can go further, this particular approach, is defining more "events" in the tracking process and decoupling "event identification" from the process of generating notes for such events.

Interesting, can you give an example (say, in case of @RedDocMD's problem) where isTracked() would give an incorrect result but "I am tracking this value" would give a correct result? I guess such situations will arise more often as we add more tracking, but still.

I'm not super sure, but I think smth like this will suffice:

// a and b are completely unrelated smart pointers
raw = a.get();
if (!b.get() && !a.get()) {
  use(*raw);
}

We can attempt to track b.get() because of the conditions. Checker wouldn't get that we are actually interested in conditions and will initiate a full-blown back-trace for b even though we don't need it.

I think it's a good time to step back and talk about statefulness of this whole thing.

Path sensitive checkers are stateless; their state is instead encoded in the ProgramState structure. This is essential because execution paths are branchy and nodes are modeled in a fairly arbitrary order.

Bug visitors are stateful. This is fine because every bug path is linear and traversed in order. Value tracks are linear too - whelp, most of the time, I suppose, but we don't mind basically forking our entire stateful machine whenever this is violated.

Note tags, on the other hand, are stateless again by design. This is why they are designed to rely on information in the bug report, such as interestingness, for communication across tags. Interestingness is an older concept that is simply re-used in note tags but from the perspective of note tags interestingness is a very simple thing: it's their state.

Information attached to the bug report is arbitrary. We can extend it as much as we want, make a data structure equivalent of the GenericDataMap that would track checker-specific variations over the concept of interestingness. It could decorate the specific role of every value as much as necessary. If we need to specify that this value is the same value that's seen by that other note tag (or participated in the emission of the warning itself in a certain manner), we can totally do that.

Hence the question, do we really need to preserve this whole imperative concept of "I" in "I am tracking this value", or is the functional concept of "This value is playing a certain role" entirely sufficient for our purposes?

I'm glad that you brought this up and don't understand why you don't see it, really. Visitors are stateful. Hmm, I don't like it because it can be confusing because of program state and so on. So, let's say "Visitors are mutable". They can actually change, carry one piece of data. Tracker/handler mechanism is the same, you are allowed to carry some mutable data through the process. Maybe it is bad, but I don't see the way to get isTracked to solve all the problems that I want to solve (it is probably trying to solve bigger problems). Can we move inline defensive checks out of the BugReportVisitors.cpp with isTracked only?

And yes, checkers should be able to say "I track this value". It is the operation that makes sense only if it is helpful for the checker, and the checker should be able to customize it however the checker pleases.

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
147	Gotcha! I'll add it!

In D103605#2802240, @vsavchenko wrote:

In D103605#2801681, @NoQ wrote:

OK, let's make a pause right here. We again start to go into other topics.

These are not other topics. We're discussing the overall direction into which this patchset is a large step. I absolutely welcome this step and am fascinated to see how it turns out but I want us to agree on the overall direction first.

You come from a pure perspective that track explains how the value got here and does only that. In this picture of the world, track has one starting point and one ending point. If it ends to early (due to the lack of domain knowledge), the checker can pick up and restart the tracking if needed. If it ends for real, this is the earliest where we got this value from and it can tell us about the warning and situation we are in. The problem is that tracking is fundamentally dirty the way it is. It is not linear, and when we track one simple value, it actually spawns a gigantic tree of traces. Thus when we get to the point where it all ended, it actually ended 100 times. It ended in multiple different visitors that were tracking all kinds of things not limited to the value itself. trackExpressionValue is hard also because of this reason. Inline defensive checks don't belong there and clutter it.

I DON'T KNOW how to simplify this ATM. Probably we can decouple ONE VALUE TRACKING from the rest of it, but it's not what I'm trying to do. I want to refactor the existing solution (with all of its inherent design flaws) into something more extensible. And I think that again ATM, with the current state of things, the concept of "tracking stopped here" is doomed.

The reason why I come from this perspective is because that's how I think it is supposed to work and I'm not aware of a single good reason why it can't. I think the reason why it's this way is that the original authors seem to have bombarded the code with extra clauses until it starts demonstrating the more or less expected behavior, without trying to solve the problem on paper first. Like with visitor fixpoint iteration, it sounds like an unnecessary piece of complexity that we will be ultimately able to eliminate.

Except, well, the distant constraints example a-la @RedDocMD's problem but for raw pointers. This is indeed an exceptional case which may cause two tracking efforts to run in parallel. This can be probably generalized to any constraint-like traits: if we want to explain a constraint, we'll have to be prepared that the constraint isn't part of the value track that we already have. This is the first time the above assumption was challenged and I definitely need time to process this but I still think there shouldn't be any other situations like this and there should be something special about this situation that will allow us to incorporate it into the linear-track solution.

"At the moment such refactoring is difficult to achieve" is a perfectly satisfying answer to me with respect to the current state of things. But I want us to agree as early as possible about whether we want to achieve this. Simply so that as we make every step, we keep this in mind and evaluate every step as taking us either closer to this goal or farther from this goal and try to avoid the latter. And I don't think that anything prevents us from discussing this goal right now; the problem we're trying to solve isn't really ever going to change and we already have a fairly large amount of data.

With the new solution, if you want the most simple ways of addressing it, you create a handler:
class FooHandler : public ExpressionHandler {
public:
  using ExpressionHandler::ExpressionHandler;

  Tracker::Result handle(const Expr *E, const ExplodedNode *InputNode,
                         const ExplodedNode *ExprNode,
                         TrackingOptions Opts) {
    if (auto *Arg = getFooArgument(E)) {
      return getParentTracker().track(E, ExprNode, Opts);
    }
    return {};
};
Essentially these handlers is my way of first introducing new logic for these two checkers, and second to extract similar logic that exists there from all other checkers.

Ok, do I understand correctly that:

Functionality-wise, this is entirely equivalent to my NoteTag solution with "isTracked()";
Boilerplate-wise it makes the isTracked() check implicit ("we were invoked at all therefore the value of the expression is tracked by... somebody") at the cost of asking the user to define a class;
Convenience-wise, it potentially requires the developer to reverse-engineer the symbolic execution step when the logic gets more complicated;
Flexibility-wise... I guess you can make custom trackers now, which will track the value in a different manner, and communicate with the tracker in order to produce different notes?

Then, let's say, I decided that I don't like the note "d is assigned here". It doesn't seem very specific and adding more notes on top will overwhelm our poor user even more. So, I decided to make notes: "d is foo'ed from e", "g is initialized to foo from e", and "p's foo is used as the 1st argument".
Right now, there is literally no way to do this except for adding this directly to already bloated FindLastStoreBRVisitor. Why do you think it already produces notes about "garbage values" and "null/nil pointers"?

So, the checker implements a StoreHandler where they handle all the cases. It is important to have it exactly like this because FindLastStoreBRVisitor has non-trivial logic of catching "stores" and we don't want this logic to get copied into different checkers.

And if we are talking where it can go further, this particular approach, is defining more "events" in the tracking process and decoupling "event identification" from the process of generating notes for such events.

Yeah, the store handler part seems perfectly legit because *I* don't see any better approaches here :) It's indeed a big problem that we can't customize existing notes and this sounds like the right data structure to solve the problem.

Interesting, can you give an example (say, in case of @RedDocMD's problem) where isTracked() would give an incorrect result but "I am tracking this value" would give a correct result? I guess such situations will arise more often as we add more tracking, but still.

I'm not super sure, but I think smth like this will suffice:
// a and b are completely unrelated smart pointers
raw = a.get();
if (!b.get() && !a.get()) {
  use(*raw);
}
We can attempt to track b.get() because of the conditions. Checker wouldn't get that we are actually interested in conditions and will initiate a full-blown back-trace for b even though we don't need it.
(...)
And yes, checkers should be able to say "I track this value". It is the operation that makes sense only if it is helpful for the checker, and the checker should be able to customize it however the checker pleases.

Ok, I don't understand! If we have anyway tracked the value back to b.get(), why not track further through b? What makes b.get() the exact place at which we want to stop tracking the value? If we wanted to know why we choose the true path at the branch, shouldn't we track further through b as well in order to answer that question? If we didn't want to know why we choose the true path, why did we track the value back to b.get() in the first place?

I guess one possible difference between a and b here would be in messaging, i.e. how we refer to that value. For example, we could refer to a as "the pointer value" (implying that this is the null-dereferenced pointer value) and refer to b as "a pointer value" (implying that it's some other value, maybe the one that participates in condition later). But even then, this distinction is much more coarse-grained than "I track the value" vs. "Someone else tracks the value". It's more like "Buggy value" vs. "Condition value" which is a distinction we already "almost" make with TrackingKind. Even if we wanted to identify our values as "pointer #1", "pointer #2", etc., we could totally do with a sufficiently advanced definition of isTracked().

Can we move inline defensive checks out of the BugReportVisitors.cpp with isTracked() only?

My original thinking was along the lines of, eg.,:

void ExprEngine::processBranch(...) {
  Bldr.addTransition(..., getNoteTag([] (BugReport &BR) {
    if (BR.isTracked(Condition) && !N->getLocationContext().inTopFrame())
      BR.markInvalid();
  }));
}

And this is how, basically, the entirety of trackExpressionValue() could be decomposed into individual stateless note tags, with all the necessary state stuffed into the bug report. Including the tracking process itself: if note tags are sufficient to help with tracking, they're sufficient to do the tracking from the start.

This code could also be moved into a checker's checkBranchCondition() so that to make it truly checker-specific. That would come at the cost of creating a lot of redundant exploded nodes just for the sake of attaching the tag that will probably never be used so this is rather bad, wouldn't recommend.

I still think the solution that relies on well-defined end of tracking is significantly more elegant. Which, again, could be implemented with note tags: we could have a note tag that'd call a callback stuffed directly into the bug report.

So I think it can work both ways given that the tags solution would rely on having a sufficiently flexible mutable state as part of the BugReport object.

In D103605#2804421, @NoQ wrote:

These are not other topics. We're discussing the overall direction into which this patchset is a large step. I absolutely welcome this step and am fascinated to see how it turns out but I want us to agree on the overall direction first.

OK, fair point, but for further plan, we need to have an incremental model of how we want to achieve it.

The reason why I come from this perspective is because that's how I think it is supposed to work and I'm not aware of a single good reason why it can't. I think the reason why it's this way is that the original authors seem to have bombarded the code with extra clauses until it starts demonstrating the more or less expected behavior, without trying to solve the problem on paper first. Like with visitor fixpoint iteration, it sounds like an unnecessary piece of complexity that we will be ultimately able to eliminate.

Right now it is all very tightly coupled. It is a lot of moving pieces that I don't think anyone fully understands. If we take time and decompose existing tracking into handlers, it can be a step in getting to know some implicit connections between different tracking activities we have right now.

Except, well, the distant constraints example a-la @RedDocMD's problem but for raw pointers. This is indeed an exceptional case which may cause two tracking efforts to run in parallel. This can be probably generalized to any constraint-like traits: if we want to explain a constraint, we'll have to be prepared that the constraint isn't part of the value track that we already have. This is the first time the above assumption was challenged and I definitely need time to process this but I still think there shouldn't be any other situations like this and there should be something special about this situation that will allow us to incorporate it into the linear-track solution.

"At the moment such refactoring is difficult to achieve" is a perfectly satisfying answer to me with respect to the current state of things. But I want us to agree as early as possible about whether we want to achieve this. Simply so that as we make every step, we keep this in mind and evaluate every step as taking us either closer to this goal or farther from this goal and try to avoid the latter. And I don't think that anything prevents us from discussing this goal right now; the problem we're trying to solve isn't really ever going to change and we already have a fairly large amount of data.

I get it, but I think we got stuck and should put goal first and think about what can be done to achieve this. Is it bad that tracking exists and the analyzer reverse engineers what it did itself? - No doubt! Is note mechanism a good alternative? - Sure! Is it enough right now? - No. Let's think how we can get there with incremental changes.

Ok, do I understand correctly that:

Functionality-wise, this is entirely equivalent to my NoteTag solution with "isTracked()";

Not really, you also get the context of tracking: two nodes and options (so you can figure out if it's being tracked thoroughly or as a condition).

Boilerplate-wise it makes the isTracked() check implicit ("we were invoked at all therefore the value of the expression is tracked by... somebody") at the cost of asking the user to define a class;

I suppose we can put it this way.

Convenience-wise, it potentially requires the developer to reverse-engineer the symbolic execution step when the logic gets more complicated;

I don't see ANY difference with the situation where we use isTracked. If you talk about notes, you can access them here as well. If you don't, then both ways require the same.

Flexibility-wise... I guess you can make custom trackers now, which will track the value in a different manner, and communicate with the tracker in order to produce different notes?

Exactly, tracker object is the thing that is guaranteed to stay whenever you go with tracking, and since you can customize it, you can affect every part of the tracking process.

Yeah, the store handler part seems perfectly legit because *I* don't see any better approaches here :) It's indeed a big problem that we can't customize existing notes and this sounds like the right data structure to solve the problem.

And we do need to have a central customizable object (aka Tracker) that we pass around the tracking process to make it happen, right?

Ok, I don't understand! If we have anyway tracked the value back to b.get(), why not track further through b? What makes b.get() the exact place at which we want to stop tracking the value? If we wanted to know why we choose the true path at the branch, shouldn't we track further through b as well in order to answer that question? If we didn't want to know why we choose the true path, why did we track the value back to b.get() in the first place?

I guess one possible difference between a and b here would be in messaging, i.e. how we refer to that value.

Bingo!

For example, we could refer to a as "the pointer value" (implying that this is the null-dereferenced pointer value) and refer to b as "a pointer value" (implying that it's some other value, maybe the one that participates in condition later). But even then, this distinction is much more coarse-grained than "I track the value" vs. "Someone else tracks the value". It's more like "Buggy value" vs. "Condition value" which is a distinction we already "almost" make with TrackingKind. Even if we wanted to identify our values as "pointer #1", "pointer #2", etc., we could totally do with a sufficiently advanced definition of isTracked().

Correct!

Can we move inline defensive checks out of the BugReportVisitors.cpp with isTracked() only?

My original thinking was along the lines of, eg.,:
void ExprEngine::processBranch(...) {
  Bldr.addTransition(..., getNoteTag([] (BugReport &BR) {
    if (BR.isTracked(Condition) && !N->getLocationContext().inTopFrame())
      BR.markInvalid();
  }));
}
And this is how, basically, the entirety of trackExpressionValue() could be decomposed into individual stateless note tags, with all the necessary state stuffed into the bug report. Including the tracking process itself: if note tags are sufficient to help with tracking, they're sufficient to do the tracking from the start.

Maybe, or maybe not. Maybe it takes more context than we think. You essentially assume that the only thing that we need to reimplement is one boolean variable isTracked and the expression that is being tracked. I doubt it.
And it is harder to debug polling approach as opposed to the callback approach.

This code could also be moved into a checker's checkBranchCondition() so that to make it truly checker-specific. That would come at the cost of creating a lot of redundant exploded nodes just for the sake of attaching the tag that will probably never be used so this is rather bad, wouldn't recommend.

Why not model defensive checks in the checker, so we don't make undesired assumptions in the first place?

I still think the solution that relies on well-defined end of tracking is significantly more elegant. Which, again, could be implemented with note tags: we could have a note tag that'd call a callback stuffed directly into the bug report.

I'm not sure I understood the last thing, can you please elaborate?

So I think it can work both ways given that the tags solution would rely on having a sufficiently flexible mutable state as part of the BugReport object.

OK, let me re-iterate to get some things straight.
Refactoring of trackExpressionValue is good, but we need to think further (and even further). You have a dream of getting rid of visitors and tracking altogether so that users don't need to reverse engineer their own checkers (preferably using tags). As a matter of transformation, you see that tracking should provide two new concepts: isTracking predicate (similar to the existing interesting-ness mechanism) and the way to detect where the tracking stopped. You also don't want tracking to be a gigantic tree and have everything, but to have one source of value without anything else. All other parts should be decoupled and maybe even given a different name. We still want to remove all different parts of other checkers from BugReporterVisitors.cpp and probably can do it using just one abstraction of isTracking to do that.

And here is my take on things (it is a bit chaotic, sorry about that):

If we want to move to isTracking, we still need a central component that participates in the tracking or something that is shared through all tracking logic. In my patches it is a Tracker object. It can be done differently by passing such object directly into every trackExpressionValue and its sub-visitors, but probably something more extensible would be a preference here.
If we want to split trackExpressionValue and extract checker-specific parts, we need a good and reliable way of splitting it. I used a time-tested approach of function extraction: if we have foo(A a, B b, C c) { ... } with N independent parts, we can refactor them into foo(A a, B b, C c) { foo1(a, b, c); foo2(a, b, c); ... fooN(a, b, c); }. If one of the fooK is big, we can repeat the same procedure. Instead of doing it plainly, I decided that the user might be interested in putting their own fooK into the mix.
Can we implement isTracked first and then split trackExpressionValue? My answer is no. It implies bigger refactoring because isTracking doesn't provide all the bits of A a, B b, C c that the logic used to receive before. One example here is the SuppressInlineDefensiveChecksVisitor. When we add it, we give it a node where the tracking of the expression was initiated, not the node with expression itself. We also have a very specific comment there that this is a requirement and expression own node might be "too late". If you use isTracked in the Visitor you will be automatically in the "too late" situation. And it is only one example. So, I insist that isTracked couldn't be the initial step for refactoring.
Can we implement "tracking has stopped mechanism" first and then split trackExpressionValue? My answer is no. Tracking right now is a tree and it is very convoluted. There is no one single point where it stops, it stops 100 of times. The concept of "stopped" in this situation is not useful (I'm still not convinced that it's useful at all, but this is a part of a different conversation). In order to fix that we need to split trackExpressionValue. Contradiction.
Can we implement isTracked afterwards? Yes, we can. Implementation of isTracked on it's own is fairly simple. And we can have users of isTracked and ExpressionHandler at the same time. Maybe, once we migrate all uses, ExpressionHandler can be safely retired.
Can we implement "tracking has stopped mechanism". Probably. This is something that I doubt in usefulness, and don't really want to plan.
Is there a way to get rid of tracking? Honestly, I don't think so. I do believe, though, that it is possible to stop reverse engineering ourselves. While the analyzer explores different paths, both the core and checkers somehow affect the state. We can cal such a change of the state as "event". Events can be both relevant and irrelevant (e.g. happened to a different smart pointer) to the issue that is found. Many of these events are pivotal to the user's understanding of the issue and we should explain them with notes. The way to understand that it is relevant is through tracking. We might want to keep these traces as we go, but it might be some extra price that we don't want to pay.
Let's get back to the events! Can we build our entire solution around them? I do think so, actually. I think that the analyzer can produce a bunch of events (maybe implemented as note tags). Then each bug report can provide a visitor, not exploded graph visitor as it is now, but event visitor. It will be automatically called for relevant events only (with tracking hidden under the hood). Many events will have reasonable defaults so you don't need to write your own implementations every time (e.g. "a is assigned with b here"), but you can always override them. This is something that I started with StoreHandler because "store" is one of those events. It is different though from the perfect vision because we don't want to reverse engineer what the engine/the checker did before, perfectly we would have this information in the node (or on the edge) that "store event happened". Tracker will be very trivial in this plan.
Is it possible to be done incrementally? I do think so. We can extract different events from trackExpressionValue and change them into various EventHandlers (like StoreHandler). Probably, in the process, we will understand what kind of data describes each type of events. Another step is pretty hard, we need to design how we are going to generate (easier) and store (harder) events during analysis. It should be pretty implicit, so it has a lot of sensible defaults produced by the core. And checker developers should be able to generate their own events (maybe even for other checkers, if they model something that everyone can benefit from). I think it will be possible to generate one event type at a time, and change how we handle it in the tracker, i.e. move from reverse engineering to detecting relevant events in the graph. Once we are finished with this, we can extract a separate thing called EventVisitor and migrate a big chunk of Tracker and EventHandler interfaces there.
Why do we even need to introduce tracker and event handlers if we are going to hide/delete them in this plan? The trick is about making this whole plan manageable. One enormous refactoring is hard to plan and test. Big work can demotivate and drain if you don't see the end of it. I believe that we can make great things if we take a lot of baby steps.

Phew, I hope this answers it.

... Implementation of isTracked on it's own is fairly simple. And we can have users of isTracked and ExpressionHandler at the same time. Maybe, once we migrate all uses, ExpressionHandler can be safely retired.

I think this sounds like a plan. More specifically, if we could maintain a map of all currently tracked expressions inside the PathSensitiveBugReport object, dynamically updated as tracking proceeds (expressions added as they're tracked and removed as soon as they're no longer necessary), with values in the map being arbitrary auxiliary information we want to pass around (eg., an explanation of how to refer to the value in notes), then all tracking machineries, regardless of how they're implemented, could communicate to each other easily through that map.

I still think the solution that relies on well-defined end of tracking is significantly more elegant. Which, again, could be implemented with note tags: we could have a note tag that'd call a callback stuffed directly into the bug report.

I'm not sure I understood the last thing, can you please elaborate?

I was thinking of something like:

void NullDereferenceChecker::checkLocation(...) {
  trackExpressionValue(
      BR, Ex, /*completionHandler=*/ [](const ExplodedNode *N) {
        if (N->getLocationContext()->inTopFrame())
          BR.markInvalid();
      }); // stashes completionHandler into the aforementioned tracking map
          // inside BR for Ex.
}

void ExprEngine::processBranch(...) {
  Bldr.addTransition(..., getNoteTag([](BugReport &BR) {
    if (BR.isTracked(Condition)) {
      // We've tracked a null pointer back to the assumption.
      // Our job here is done. Look up the callback in the tracking map.
      if (auto completionHandler = BR.getTrackingCompletionHandler(Condition))
        (*completionHandler)();
    }
  }));
}

Why do we even need to introduce tracker and event handlers if we are going to hide/delete them in this plan? The trick is about making this whole plan manageable. One enormous refactoring is hard to plan and test. Big work can demotivate and drain if you don't see the end of it. I believe that we can make great things if we take a lot of baby steps.

I completely agree. Incremental development ftw! I guess I'll keep an eye on whether we could convert some of your newly extracted handlers to note tags immediately (but most will probably require the expression tracking map to be implemented first).

This code could also be moved into a checker's checkBranchCondition() so that to make it truly checker-specific. That would come at the cost of creating a lot of redundant exploded nodes just for the sake of attaching the tag that will probably never be used so this is rather bad, wouldn't recommend.

Why not model defensive checks in the checker, so we don't make undesired assumptions in the first place?

That's pretty difficult. We can't enter (or skip) a branch without recording its respective condition. We can't continue analysis unless we enter (or skip) the branch. Once the branch has exited, we can't get rid of the constraint yet because we may encounter the same condition again and we're obliged to be consistent. Even when we've exited the nested stack frame we're still obliged to be consistent, unless we're also willing to completely drop the execution path inside the function and replace it with some form of summary. Or, you know, do summaries from the start.

Ok, I think I mostly understand where this is going, thanks!

I'll now dive into code.

NoQ added inline comments.Jun 9 2021, 9:47 PM

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
160–161	Empty deques can be pretty large: https://en.cppreference.com/w/cpp/container/deque e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++ If we're attaching multiple trackers per non-deduplicated bug report I suspect this can easily amount to like 20-30 MB per process. I suspect that a list would be cheaper in every way, esp. given that we don't need random access.
210	That said, I'd rather not force the caller to perform a `getAs<>()`. It's pretty cumbersome and we can do the right thing (ignore it) in the callee anyway.

vsavchenko added inline comments.Jun 10 2021, 2:09 AM

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
110	It is not really about suppressing it at all for all checkers. Such option already exists in analyzer options. This one is used specifically for situations when globally suppressions are on, but don't make sense for that particular type (like array index).
210	OK, I see, so we can call it with whatever value, but track only known values, correct?

Fix review remarks

Harbormaster completed remote builds in B108566: Diff 351098.Jun 10 2021, 4:03 AM

Ok this is amazing, no more comments!!

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h
210	Yup!

This revision is now accepted and ready to land.Jun 10 2021, 11:39 PM

This revision was landed with ongoing or failed builds.Jun 11 2021, 2:51 AM

Closed by commit rG0cc3100bf8d1: [analyzer] Introduce a new interface for tracking (authored by vsavchenko). · Explain Why

This revision was automatically updated to reflect the committed changes.

vsavchenko added a commit: rG0cc3100bf8d1: [analyzer] Introduce a new interface for tracking.

Sorry for the absence, took my time to catch up with this series! Really interesting insight into how a new interface is being designed! I need some time to digest things, but can't wait to help on the remaining patches.

In D103605#2816979, @Szelethus wrote:

Sorry for the absence, took my time to catch up with this series! Really interesting insight into how a new interface is being designed! I need some time to digest things, but can't wait to help on the remaining patches.

Hey, thanks! Here is one example of how it can be useful for checkers (D104136), maybe it can help to form a full picture.

Revision Contents

Path

Size

clang/

include/

clang/

StaticAnalyzer/

Core/

BugReporter/

BugReporterVisitors.h

219 lines

lib/

StaticAnalyzer/

Core/

BugReporterVisitors.cpp

48 lines

Diff 351377

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h

Show All 15 Lines

#include "clang/Analysis/ProgramPoint.h"		#include "clang/Analysis/ProgramPoint.h"
#include "clang/Basic/LLVM.h"		#include "clang/Basic/LLVM.h"
#include "clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h"		#include "clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h"
#include "clang/StaticAnalyzer/Core/PathSensitive/SVals.h"		#include "clang/StaticAnalyzer/Core/PathSensitive/SVals.h"
#include "llvm/ADT/FoldingSet.h"		#include "llvm/ADT/FoldingSet.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
		#include <list>
#include <memory>		#include <memory>
		#include <utility>

namespace clang {		namespace clang {

class BinaryOperator;		class BinaryOperator;
class CFGBlock;		class CFGBlock;
class DeclRefExpr;		class DeclRefExpr;
class Expr;		class Expr;
class Stmt;		class Stmt;
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	enum class TrackingKind {
Thorough,		Thorough,
/// Specifies that a more moderate tracking should be used for the expression		/// Specifies that a more moderate tracking should be used for the expression
/// value. This will essentially make sure that functions relevant to it		/// value. This will essentially make sure that functions relevant to it
/// aren't pruned, but otherwise relies on the user reading the code or		/// aren't pruned, but otherwise relies on the user reading the code or
/// following the arrows.		/// following the arrows.
Condition		Condition
};		};

		/// Defines a set of options altering tracking behavior.
		struct TrackingOptions {
		/// Specifies the kind of tracking.
		TrackingKind Kind = TrackingKind::Thorough;
		/// Specifies whether we should employ false positive suppression
		/// (inlined defensive checks, returned null).
		bool EnableNullFPSuppression = true;
		xazax.hunUnsubmitted Done Reply Inline Actions I vaguely remember we wanting to put this defensive check suppression in the related checkers. If that is the case, I wonder if this option belongs here. xazax.hun: I vaguely remember we wanting to put this defensive check suppression in the related checkers.
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions It is not really about suppressing it at all for all checkers. Such option already exists in analyzer options. This one is used specifically for situations when globally suppressions are on, but don't make sense for that particular type (like array index). vsavchenko: It is not really about suppressing it at all for all checkers. Such option already exists in…
		};

		/// Describes an event when the value got stored into a memory region.
		martongUnsubmitted Done Reply Inline Actions I think we should document this class' responsibility with a terse sentence. martong: I think we should document this class' responsibility with a terse sentence.
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions Sure vsavchenko: Sure
		///
		/// As opposed to checker checkBind API, it reacts also to binds
		/// generated by the checker as well. It can be useful when the binding
		/// happened as a result of evalCall, for example.
		struct StoreInfo {
		enum Kind {
		/// The value got stored into the region during initialization:
		/// int x = 42;
		Initialization,
		/// The value got stored into the region during assignment:
		/// int x;
		/// x = 42;
		Assignment,
		/// The value got stored into the region as block capture.
		xazax.hunUnsubmitted Not Done Reply Inline Actions There are some other places were we might have effects that are very similar to stores, like invalidations during conservative function evaluation. xazax.hun: There are some other places were we might have effects that are very similar to stores, like…
		NoQUnsubmitted Not Done Reply Inline Actions Yes, this would be really interesting. Like, we probably don't want to add notes saying "we invalidated the Store so here's why we're now assuming stuff that seemingly contradicts our previous assumptions". But we may want to add suppressions based on this info, eg. "our control flow dependencies are based on invalidation, maybe let's try to find a different path to the same bug?" NoQ: Yes, this would be really interesting. Like, we probably don't want to add notes saying "we…
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions Sure, nothing actually prevents us from extending this enum (and I actually do this, since I forgot call arguments here). However, at this point this models and tears apart `FindLastStoreBRVisitor`. The main idea here was that many checkers are benefiting from understanding how certain value got bound with a certain region. But it is so intertwined with `trackExpressionValue` that you couldn't do anything like this in your own code. And even simply the code that understands the node where the store happens is hard enough to make it abstract it away. vsavchenko: Sure, nothing actually prevents us from extending this enum (and I actually do this, since I…
		/// Block data is modeled as a separate region, thus whenever
		/// the analyzer sees a captured variable, its value is copied
		/// into a special block region.
		BlockCapture
		};

		/// The type of store operation.
		Kind StoreKind;
		/// The node where the store happened.
		const ExplodedNode *StoreSite;
		/// The expression where the value comes from.
		/// NOTE: might be null.
		Expr *SourceOfTheValue;
		/// Symbolic value that is being stored.
		SVal Value;
		/// Memory regions involved in the store operation.
		/// Dest <- Origin
		/// NOTE: Origin might be null, when the stored value doesn't come
		/// from another region.
		const MemRegion Dest, Origin;
		xazax.hunUnsubmitted Done Reply Inline Actions During path sensitive analysis we already have a callback for stores. We kind of replicating this logic for bug paths. So my questions are: Do we expect to have additional information here that were not available during the analysis earlier? Do we want to make this as similar to the forward analysis part as possible for developer familiarity? xazax.hun: During path sensitive analysis we already have a callback for stores. We kind of replicating…
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions Oof, I actually think that this one is trickier. If the checker modeled some operation (most likely `evalCall`), it can bind a value to the region by itself. And this handler triggers for such events as well because it simply finds the place where the switch was flipped and the region changed its binding. vsavchenko: Oof, I actually think that this one is trickier. If the checker modeled some operation (most…
		xazax.hunUnsubmitted Done Reply Inline Actions If the checker modeled some operation (most likely evalCall), it can bind a value to the region by itself. And this handler triggers for such events as well because it simply finds the place where the switch was flipped and the region changed its binding. This is such a great point, I'd like that to be captured as a comment somewhere. xazax.hun: > If the checker modeled some operation (most likely evalCall), it can bind a value to the…
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions Gotcha! I'll add it! vsavchenko: Gotcha! I'll add it!
		};

		class ExpressionHandler;
		class StoreHandler;

		/// A generalized component for tracking expressions, values, and stores.
		///
		/// Tracker aimes at providing a sensible set of default behaviors that can be
		/// used by any checker, while providing mechanisms to hook into any part of the
		/// tracking process and insert checker-specific logic.
		class Tracker {
		private:
		using ExpressionHandlerPtr = std::unique_ptr<ExpressionHandler>;
		using StoreHandlerPtr = std::unique_ptr<StoreHandler>;
		NoQUnsubmitted Not Done Reply Inline Actions Empty deques can be pretty large: https://en.cppreference.com/w/cpp/container/deque e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++ If we're attaching multiple trackers per non-deduplicated bug report I suspect this can easily amount to like 20-30 MB per process. I suspect that a list would be cheaper in every way, esp. given that we don't need random access. NoQ: Empty deques can be pretty large: > https://en.cppreference.com/w/cpp/container/deque > e.g. 8…

		PathSensitiveBugReport &Report;
		std::list<ExpressionHandlerPtr> ExpressionHandlers;
		std::list<StoreHandlerPtr> StoreHandlers;

		public:
		/// \param Report The bug report to which visitors should be attached.
		Tracker(PathSensitiveBugReport &Report);
		virtual ~Tracker() = default;

		PathSensitiveBugReport &getReport() { return Report; }

		/// Describes a tracking result with the most basic information of what was
		/// actually done (or not done).
		struct Result {
		/// Usually it means that the tracker added visitors.
		bool FoundSomethingToTrack = false;
		/// Signifies that the tracking was interrupted at some point.
		/// Usually this information is important only for sub-trackers.
		bool WasInterrupted = false;

		/// Combines the current result with the given result.
		void combineWith(const Result &Other) {
		// If we found something in one of the cases, we can
		// say we found something overall.
		FoundSomethingToTrack \|= Other.FoundSomethingToTrack;
		// The same goes to the interruption.
		WasInterrupted \|= Other.WasInterrupted;
		}
		};

		/// Track expression value back to its point of origin.
		///
		/// \param E The expression value which we are tracking
		/// \param N A node "downstream" from the evaluation of the statement.
		/// \param Opts Tracking options specifying how we want to track the value.
		virtual Result track(const Expr E, const ExplodedNode N,
		TrackingOptions Opts = {});

		/// Track how the value got stored into the given region and where it came
		/// from.
		///
		/// \param V We're searching for the store where \c R received this value.
		/// \param R The region we're tracking.
		/// \param Opts Tracking options specifying how we want to track the value.
		/// \param Origin Only adds notes when the last store happened in a
		/// different stackframe to this one. Disregarded if the tracking kind
		/// is thorough.
		/// This is useful, because for non-tracked regions, notes about
		xazax.hunUnsubmitted Not Done Reply Inline Actions Not directly related to this patch, but I wonder if we want to have unknown values with identities at some point, so we can track them. xazax.hun: Not directly related to this patch, but I wonder if we want to have unknown values with…
		NoQUnsubmitted Not Done Reply Inline Actions `UnknownVal` is a stop-gap for situations when we've screwed up so badly we don't even have types anymore. The only thing I'd ever want from them is to disappear :) I guess this could be useful for a debug checker that could explain where does an `UnknownVal` come from. In this case unknown values don't need identities; we can track other values without identities, such as null pointers, just fine. NoQ: `UnknownVal` is a stop-gap for situations when we've screwed up so badly we don't even have…
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions +1 for not caring about `UnknownVal`. vsavchenko: +1 for not caring about `UnknownVal`.
		NoQUnsubmitted Done Reply Inline Actions That said, I'd rather not force the caller to perform a `getAs<>()`. It's pretty cumbersome and we can do the right thing (ignore it) in the callee anyway. NoQ: That said, I'd rather not force the caller to perform a `getAs<>()`. It's pretty cumbersome and…
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions OK, I see, so we can call it with whatever value, but track only known values, correct? vsavchenko: OK, I see, so we can call it with whatever value, but track only known values, correct?
		NoQUnsubmitted Not Done Reply Inline Actions Yup! NoQ: Yup!
		/// changes to its value in a nested stackframe could be pruned, and
		/// this visitor can prevent that without polluting the bugpath too
		/// much.
		virtual Result track(SVal V, const MemRegion *R, TrackingOptions Opts = {},
		const StackFrameContext *Origin = nullptr);

		/// Handle the store operation and produce the note.
		///
		/// \param SI The information fully describing the store.
		/// \param Opts Tracking options specifying how we got to it.
		///
		/// NOTE: this method is designed for sub-trackers and visitors.
		virtual PathDiagnosticPieceRef handle(StoreInfo SI, TrackingOptions Opts);

		/// Add custom expression handler with the highest priority.
		///
		/// It means that it will be asked for handling first, and can prevent
		/// other handlers from running if decides to interrupt.
		void addHighPriorityHandler(ExpressionHandlerPtr SH) {
		ExpressionHandlers.push_front(std::move(SH));
		}
		martongUnsubmitted Not Done Reply Inline Actions This naming is a bit too bloated to me, what do you think about this: enum class Position { Front, Back }; addExpressionHandler(Front, ...) { martong: This naming is a bit too bloated to me, what do you think about this: ``` enum class Position {…
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions Maybe `addHighPriorityHandler` and `addLowPriorityHandler`? There is probably no reason to put `Expression` and `Store` into the actual name. vsavchenko: Maybe `addHighPriorityHandler` and `addLowPriorityHandler`? There is probably no reason to put…
		vsavchenkoAuthorUnsubmitted Done Reply Inline Actions What do you think about this solution? vsavchenko: What do you think about this solution?
		martongUnsubmitted Not Done Reply Inline Actions Sounds good. martong: Sounds good.

		/// Add custom expression handler with the lowest priority.
		///
		/// It means that it will be asked for handling last, and other handlers can
		/// prevent it from running if any of them decides to interrupt.
		void addLowPriorityHandler(ExpressionHandlerPtr SH) {
		ExpressionHandlers.push_back(std::move(SH));
		}

		/// Add custom store handler with the highest priority.
		///
		/// It means that it will be asked for handling first, and will prevent
		/// other handlers from running if it produces non-null note.
		void addHighPriorityHandler(StoreHandlerPtr SH) {
		StoreHandlers.push_front(std::move(SH));
		}

		/// Add custom store handler with the lowest priority.
		///
		/// It means that it will be asked for handling last, only
		/// if all other handlers failed to produce the note.
		void addLowPriorityHandler(StoreHandlerPtr SH) {
		StoreHandlers.push_back(std::move(SH));
		}

		/// Add custom expression/store handler with the highest priority
		///
		/// See other overloads for explanation.
		template <class HandlerType, class... Args>
		void addHighPriorityHandler(Args &&... ConstructorArgs) {
		addHighPriorityHandler(std::make_unique<HandlerType>(
		*this, std::forward<Args>(ConstructorArgs)...));
		}

		/// Add custom expression/store handler with the lowest priority
		///
		/// See other overloads for explanation.
		template <class HandlerType, class... Args>
		void addLowPriorityHandler(Args &&... ConstructorArgs) {
		addLowPriorityHandler(std::make_unique<HandlerType>(
		*this, std::forward<Args>(ConstructorArgs)...));
		}
		};

		/// Handles expressions during the tracking.
		class ExpressionHandler {
		private:
		Tracker &ParentTracker;

		public:
		ExpressionHandler(Tracker &ParentTracker) : ParentTracker(ParentTracker) {}
		virtual ~ExpressionHandler() {}

		/// Handle the given expression from the given node.
		///
		/// \param E The expression value which we are tracking
		/// \param Original A node "downstream" where the tracking started.
		/// \param ExprNode A node where the evaluation of \c E actually happens.
		/// \param Opts Tracking options specifying how we are tracking the value.
		virtual Tracker::Result handle(const Expr E, const ExplodedNode Original,
		const ExplodedNode *ExprNode,
		TrackingOptions Opts) = 0;

		/// \Return the tracker that initiated the process.
		Tracker &getParentTracker() { return ParentTracker; }
		};

		/// Handles stores during the tracking.
		class StoreHandler {
		private:
		Tracker &ParentTracker;

		public:
		StoreHandler(Tracker &ParentTracker) : ParentTracker(ParentTracker) {}
		virtual ~StoreHandler() {}

		/// Handle the given store and produce the node.
		///
		/// \param E The expression value which we are tracking
		/// \param N A node where the evaluation of \c E actually happens.
		/// \param Opts Tracking options specifying how we are tracking the value.
		///
		/// \return the produced note, null if the handler doesn't support this kind
		/// of stores.
		virtual PathDiagnosticPieceRef handle(StoreInfo SI, TrackingOptions Opts) = 0;

		Tracker &getParentTracker() { return ParentTracker; }
		};

/// Attempts to add visitors to track expression value back to its point of		/// Attempts to add visitors to track expression value back to its point of
/// origin.		/// origin.
///		///
/// \param N A node "downstream" from the evaluation of the statement.		/// \param N A node "downstream" from the evaluation of the statement.
/// \param E The expression value which we are tracking		/// \param E The expression value which we are tracking
/// \param R The bug report to which visitors should be attached.		/// \param R The bug report to which visitors should be attached.
/// \param EnableNullFPSuppression Whether we should employ false positive		/// \param EnableNullFPSuppression Whether we should employ false positive
/// suppression (inlined defensive checks, returned null).		/// suppression (inlined defensive checks, returned null).
▲ Show 20 Lines • Show All 298 Lines • Show Last 20 Lines

clang/lib/StaticAnalyzer/Core/BugReporterVisitors.cpp

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
#include <cassert>		#include <cassert>
#include <deque>		#include <deque>
#include <memory>		#include <memory>
#include <string>		#include <string>
#include <utility>		#include <utility>

using namespace clang;		using namespace clang;
using namespace ento;		using namespace ento;
		using namespace bugreporter;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Utility functions.		// Utility functions.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

static const Expr peelOffPointerArithmetic(const BinaryOperator B) {		static const Expr peelOffPointerArithmetic(const BinaryOperator B) {
if (B->isAdditiveOp() && B->getType()->isPointerType()) {		if (B->isAdditiveOp() && B->getType()->isPointerType()) {
if (B->getLHS()->getType()->isPointerType()) {		if (B->getLHS()->getType()->isPointerType()) {
▲ Show 20 Lines • Show All 1,964 Lines • ▼ Show 20 Lines	if (RHSV.isZeroConstant())
EnableNullFPSuppression);		EnableNullFPSuppression);
} else { // Track only the LHS of a division or a modulo.		} else { // Track only the LHS of a division or a modulo.
if (LHSV.isZeroConstant())		if (LHSV.isZeroConstant())
trackExpressionValue(InputNode, BO->getLHS(), report, TKind,		trackExpressionValue(InputNode, BO->getLHS(), report, TKind,
EnableNullFPSuppression);		EnableNullFPSuppression);
}		}
}		}

		//===----------------------------------------------------------------------===//
		// Tracker implementation
		//===----------------------------------------------------------------------===//

		Tracker::Tracker(PathSensitiveBugReport &Report) : Report(Report) {
		// TODO: split trackExpressionValue and FindLastStoreBRVisitor into handlers
		// and add them here.
		}

		Tracker::Result Tracker::track(const Expr E, const ExplodedNode N,
		TrackingOptions Opts) {
		if (!E \|\| !N)
		return {};

		const Expr *Inner = peelOffOuterExpr(E, N);
		const ExplodedNode *LVNode = findNodeForExpression(N, Inner);
		if (!LVNode)
		return {};

		Result CombinedResult;
		// Iterate through the handlers in the order according to their priorities.
		for (ExpressionHandlerPtr &Handler : ExpressionHandlers) {
		CombinedResult.combineWith(Handler->handle(Inner, N, LVNode, Opts));
		if (CombinedResult.WasInterrupted)
		break;
		}

		return CombinedResult;
		}

		Tracker::Result Tracker::track(SVal V, const MemRegion *R, TrackingOptions Opts,
		const StackFrameContext *Origin) {
		// TODO: support this operation after dismantling FindLastStoreBRVisitor
		return {};
		}

		PathDiagnosticPieceRef Tracker::handle(StoreInfo SI, TrackingOptions Opts) {
		// Iterate through the handlers in the order according to their priorities.
		for (StoreHandlerPtr &Handler : StoreHandlers) {
		if (PathDiagnosticPieceRef Result = Handler->handle(SI, Opts))
		// If the handler produced a non-null piece, return it.
		// There is no need in asking other handlers.
		return Result;
		}
		return {};
		}

bool bugreporter::trackExpressionValue(const ExplodedNode *InputNode,		bool bugreporter::trackExpressionValue(const ExplodedNode *InputNode,
const Expr *E,		const Expr *E,
PathSensitiveBugReport &report,		PathSensitiveBugReport &report,
bugreporter::TrackingKind TKind,		bugreporter::TrackingKind TKind,
bool EnableNullFPSuppression) {		bool EnableNullFPSuppression) {

if (!E \|\| !InputNode)		if (!E \|\| !InputNode)
return false;		return false;
▲ Show 20 Lines • Show All 982 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Introduce a new interface for trackingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 351377

clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporterVisitors.h

clang/lib/StaticAnalyzer/Core/BugReporterVisitors.cpp

[analyzer] Introduce a new interface for tracking
ClosedPublic