This is an archive of the discontinued LLVM Phabricator instance.

[Analyzer] Widening loops which do not exit
ClosedPublic

Authored by seaneveson on Aug 26 2015, 3:59 AM.

Details

Summary

Dear All,

We have been looking at the following problem, where any code after the constant bound loop is not analyzed because of the limit on how many times the same block is visited, as described in bugzillas #7638 and #23438. This problem is of interest to us because we have identified significant bugs that the checkers are not locating. We have been discussing a solution involving ranges as a longer term project, but I would like to propose a patch to improve the current implementation.

Example issue:

for (int i = 0; i < 1000; ++i) {...something...}
int *p = 0;
*p = 0xDEADBEEF;

The proposal is to go through the first and last iterations of the loop. The patch creates an exploded node for the approximate last iteration of constant bound loops, before the max loop limit / block visit limit is reached. It does this by identifying the variable in the loop condition and finding the value which is “one away” from the loop being false. For example, if the condition is (x < 10), then an exploded node is created where the value of x is 9. Evaluating the loop body with x = 9 will then result in the analysis continuing after the loop, providing x is incremented.

The patch passes all the tests, with some modifications to coverage.c, in order to make the ‘function_which_gives_up’ continue to give up, since the changes allowed the analysis to progress past the loop.

This patch does introduce possible false positives, as a result of not knowing the state of variables which might be modified in the loop. I believe that, as a user, I would rather have false positives after loops than do no analysis at all. I understand this may not be the common opinion and am interested in hearing your views. There are also issues regarding break statements, which are not considered. A more advanced implementation of this approach might be able to consider other conditions in the loop, which would allow paths leading to breaks to be analyzed.

Lastly, I have performed a study on large code bases and I think there is little benefit in having “max-loop” default to 4 with the patch. For variable bound loops this tends to result in duplicated analysis after the loop, and it makes little difference to any constant bound loop which will do more than a few iterations. It might be beneficial to lower the default to 2, especially for the shallow analysis setting.

Please let me know your opinions on this approach to processing constant bound loops and the patch itself.

Regards,

Sean Eveson
SN Systems - Sony Computer Entertainment Group

Diff Detail

Repository
rL LLVM

Event Timeline

seaneveson updated this revision to Diff 33190.Aug 26 2015, 3:59 AM
seaneveson retitled this revision from to [Analyzer] Handling constant bound loops.
seaneveson updated this object.
seaneveson added a subscriber: cfe-commits.
zaks.anna requested changes to this revision.Aug 26 2015, 6:29 PM
zaks.anna edited edge metadata.

I agree that the way the static analyzer handles loops with known bounds is quite bad and we loose coverage because of it, so this is definitely an important problem to solve!

I've briefly looked at the patch and the main problem is that while you reset the value of the counter variable, you do not change the values of the other variables. This will leave us in a state that is wrong and will likely lead to false positives and inconsistencies, avoiding which is extremely important.

A way this could be improved is by invalidating all the values that the loops effects, both the values on the stack and on the heap. (We could even start overly conservative and invalidate all the values in the state; setting the known values to unknown values.)

This revision now requires changes to proceed.Aug 26 2015, 6:29 PM
xazax.hun edited edge metadata.Aug 26 2015, 6:40 PM

A conservative solution should work.

But if we want to preserve some precision I think it should be possible to not to invalidate those values on the stack that are not effected by the loop. The hard problem here is that, it is impossible to reason about the heap without global knowledge, so if there is a pointer on the heap to a local variable that should be invalidated as well.

What do you think?

krememek edited edge metadata.Aug 26 2015, 9:39 PM

A way this could be improved is by invalidating all the values that the loops effects, both the values on the stack and on the heap. (We could even start overly conservative and invalidate all the values in the state; setting the known values to unknown values.)

I worry that this would be far too conservative, and would introduce new false positives because of lost precision.

One unsound hack would be to analyze a loop once by unrolling, and then invalidate anything that was touched in the ProgramState during that loop iteration. That invalidation could also be transitive, e.g. any subregions of touched state, etc.

Another, more principled hack, would be to look at all DeclRefExprs within a loop and invalidate all memory in the cone-of-influence of those variables (i.e., values they point to, etc.), but that's it.

Then there is the problem of called functions within the loop, as they won't be analyzed. Those could interfere with the ability of a checker to do its job.

My recommendation is that we still unroll loops a couple times, getting full precision, and then employ a widening technique like the ones being discussed to allow the last loop iteration to act as the last one, but as a conservative over-approximation.

FWIW, I do think this is a great problem to work on. It is easy to come up with solutions that work for specific examples but fall over on general code. I completely agree that failing to analyzing code after the loop is a major hole and lost opportunity to find bugs, but fixing that should not be at a tradeoff of a huge influx in false positives. Some basic invalidation of values touched by the loop, which includes possibly invalidating checker state, will likely be necessary. I think this is what Anna was getting to in her comment.

One thing about the tests passing: that's great, but that's obviously insufficient. We probably have fewer tests showing that the analyzer can't handle something than tests that show it does handle something.

When a patch like this lands, which expands the analysis scope of the core analyzer, a few things are worth measuring:

(1) The impacted analysis time on a typical project.

(2) The impacted memory usage of the analyzer on a typical project.

(3) Increased analysis coverage. Are we covering more statements in practice? Etc. Since that is the motivation of this patch, it would be good to benchmark it on some real code.

We should, as a community, start keeping track of such things for a few projects so we know whether or not experiments like these are objectively a real win.

This will leave us in a state that is wrong and will likely lead to false positives and inconsistencies, avoiding which is extremely important.

I accept that my current patch is not a comprehensive solution to the problem and that it may introduce false positives, however I do think it is an improvement, where it is preferable to have false positives over doing no analysis after the loop.

My recommendation is that we still unroll loops a couple times, getting full precision, and then employ a widening technique like the ones being discussed to allow the last loop iteration to act as the last one, but as a conservative over-approximation.

In the patch, constant bound loops are unrolled with the “max-loop” setting (default of 4), (i = 0, 1, 2, 99) rather than (i = 0, 1, 2, 3); as such, we analyze the same number of paths through the loop body.

In my experience, constant bound loops are normally used to make simple modifications to fixed length collections of data, I think the behaviour of the majority of these loops will be represented by the first and last iteration.

Another, more principled hack, would be to look at all DeclRefExprs within a loop and invalidate all memory in the cone-of-influence of those variables (i.e., values they point to, etc.), but that's it.

Could this be done easily with a visitor and the existing invalidate methods? In cases where the anaylzer is unsure which values might be modified by the loop, it could fall back to invalidating all the values in the state.

Then there is the problem of called functions within the loop, as they won't be analyzed. Those could interfere with the ability of a checker to do its job.

My recommendation is that we still unroll loops a couple times, getting full precision, and then employ a widening technique like the ones being discussed to allow the last loop iteration to act as the last one, but as a conservative over-approximation.

Wouldn’t widening before the last iteration result in more paths being explored than additional unrolling? i.e. as a result of values in conditions being unknown.

Regards,

Sean Eveson
SN Systems - Sony Computer Entertainment Group

I accept that my current patch is not a comprehensive solution to the problem and that it may introduce > false positives, however I do think it is an improvement, where it is preferable to have false positives
over doing no analysis after the loop.

We try to avoid false positives as much as possible. They are very painful for users to deal with.

In my experience, constant bound loops are normally used to make simple modifications to fixed
length collections of data, I think the behaviour of the majority of these loops will be represented by
the first and last iteration.

The main issue with the patch is that it produces a false path on which value of only one of the variables is reset to the last iteration of the loop and the rest of them are set as if it is the 3d iteration. A way to solve this is to compute what can be invalidated by the loop and set those to unknown values (a widening operation).

You should develop this feature behind a flag. This would allow for incremental development and simplify evaluation.

I accept that my current patch is not a comprehensive solution to the problem and that it may introduce > false positives, however I do think it is an improvement, where it is preferable to have false positives
over doing no analysis after the loop.

We try to avoid false positives as much as possible. They are very painful for users to deal with.

In my experience, constant bound loops are normally used to make simple modifications to fixed
length collections of data, I think the behaviour of the majority of these loops will be represented by
the first and last iteration.

The main issue with the patch is that it produces a false path on which value of only one of the variables is reset to the last iteration of the loop and the rest of them are set as if it is the 3d iteration. A way to solve this is to compute what can be invalidated by the loop and set those to unknown values (a widening operation).

You should develop this feature behind a flag. This would allow for incremental development and simplify evaluation.

I agree this should be under a flag, or more specifically, an -analyzer-config option.

I think the functionality started here by doing something clever with loops is complex enough to warrant putting this into a separate .cpp file. We can keep this here for now, but this seems like complexity that is going to naturally creep up and make this file even more monolithic than it already is.

This change also unconditionally never evaluates the body of a loop if the loop bound is constant. That seems like a major degradation in precision, as it is worth analyzing the loop body at least once to get full precision of the paths explored within the loop body. That's why I suggest unrolling at least a couple times. The first unrolling will get good coverage of the code within the loop, and the second unrolling will capture some correlated conditions across loop iterations that can diagnose some interesting behavior. After the second unrolling (or 'nth' for some 'n'), this widening seems applicable. I've also seen some static analyzers that focused on buffer overflow detection unroll looks like this by simply extending 'n' to be the constant loop bounds, opting for maximum precision at the cost of analyzing all those iterations (which isn't necessarily the best option either, but it is an option).

It seems the patch needs two things:

  1. A general scheme for widening, which can just be invalidation of anything touched within a loop. I think that can be done by having an algorithm that does an AST walk, and looks at all variables whose lvalues are taken (used to store values or otherwise get their address using '&') or anything passed-by-reference to a function. That set of VarDecls can be cached in a side table, mapping ForStmt*'s (other loops as well) to the set of VarDecls that are invalidated. Those could then be passed to something that does the invalidation using the currently invalidation logic we have in place. The invalidation logic should also handle checker state, which also needs to get invalidated alongside variable values.
  2. We need to consult the number of times a loop has been executed to determine when to apply this widening. We can look at the basic block counts, which are already used to cull off too many loop iterations.

This patch cannot go in as is, as it will seriously degrade the quality of the analyzer without further refinements. It is a great start, and I would be supportive of the patch landing more-or-less as is if it is behind a flag, and not on by default until the improvements suggested are made.

Hi Sean,

I responded with some more feedback. Conceptually the patch is quite simple, but I think Anna’s points are all spot on. I think we’d all be quite amendable for this work to go in under a flag, with further improvements going in on top of that. That way we can all collectively start hashing this out in trunk instead of feature creeping a patch.

Ted

  1. We need to consult the number of times a loop has been executed to determine when to apply this widening. We can look at the basic block counts, which are already used to cull off too many loop iterations.

On closer inspection, this is actually already in the patch:

BldCtx.blockCount() == AMgr.options.maxBlockVisitOnPath - 1

I do wonder if nested loops will cause a problem here, however. If a loop is nested, blockCount() could already be the same as maxBlockVisitOnPath() because of the outer loop. For example:

for (int i = 0; i < 3; i++) {
  for (int j = 0; j < 2; j++) { ... }
}

On the second iteration of the outer loop, but the first iteration of the inner loop within that second outer loop iteration, the block count for the body of the nested loop will already be 2. The analyzer will prune paths, regardless of this loop widening heuristic. I don't think the suggested patch actually helps with this case because the block count gets exhausted. We may need a way to enhance the accounting of block counts when taking into account heuristics for loops.

We try to avoid false positives as much as possible. They are very painful for users to deal with.

You should develop this feature behind a flag. This would allow for incremental development and simplify evaluation.

I absolutely want to minimize false positives, but I also want to make a start on just doing something after the loop, where the solution could then be improved upon. Developing behind a flag seems to be an ideal solution to me.

I think the functionality started here by doing something clever with loops is complex enough to warrant putting this into a separate .cpp file.

Good idea.

I do wonder if nested loops will cause a problem here

Yes the patch only works with the inner most loop. A fix to this should ideally prevent inner loops being analyzed again from an identical state. I haven’t looked into if this will ‘just happen’ as a result of the framework, or how to do it otherwise.

I think we’d all be quite amendable for this work to go in under a flag, with further improvements going in on top of that. That way we can all collectively start hashing this out in trunk instead of feature creeping a patch.

If I refactor this patch in its current state into a separate file and put it behind a flag, would it then be acceptable? I would then like to take some time to look at the invalidation improvements as discussed in this thread.

If I refactor this patch in its current state into a separate file and put it behind a flag, would it then be acceptable? I would then like to take some time to look at the invalidation improvements as discussed in this thread.

I think I'd like to see two things, primarily:

  1. The logic behind and -analyzer-config option.
  2. Very basic invalidation of values, as Anna suggests, to make this patch minimally viable.

For the latter, I think there is a simple solution. Notice on ProgramState there is a method called invalidateRegions. This is a worklist algorithm that takes a set of memory regions, and invalidates all memory regions (including checker state) that is touched by those regions, or reachable via memory bindings, etc. It's the complete transitive closure. To get the conservative invalidation that Anna suggests (actually, a little less conservative), I think you can just pass in a two MemRegions as the input to that method and get a new ProgramState with the appropriate regions invalidated. That MemRegions are the MemRegion corresponding to the current stack frame, specifically divided into locals and parameters. You can get them by getting the current StackFrameContext (which you get from the LocationContext), and using the following two methods form MemRegionManager:

getStackLocalsRegion(const StackFrameContext *STC);
getStackArgumentsRegion(const StackFrameContext *STC);

Those two MemRegions will be the parent regions of all VarRegions, etc. on the current stack frame. If you pass those to invalidateRegions, that will invalidate all memory on the current stack frame, and anything they reference.

Actually, it would probably be also good to pass in the MemRegion returned by getGlobalsRegion() as well, to invalidate all global memory. But just those three will give the conservative invalidation Anna suggested, and would give a viable initial version of the patch (which should still be guarded under a flag). All told, that's probably about 10 lines of code.

With that, I think you would have a viable initial patch, and something that would be meaningfully testable.

Thank you for working on this!

To get the conservative invalidation that Anna suggests (actually, a little less conservative), I think you can just pass in a two MemRegions as the input to that method and get a new ProgramState with the appropriate regions invalidated.

Thank you for the suggestion. Unfortunately nothing seems to get invalidated when I call invalidateRegions, in the following code:

const StackFrameContext *STC = LCtx->getCurrentStackFrame();
MemRegionManager &MRMgr = svalBuilder.getRegionManager();
const MemRegion *Regions[] = {
  MRMgr.getStackLocalsRegion(STC),
  MRMgr.getStackArgumentsRegion(STC),
  MRMgr.getGlobalsRegion()
};
ProgramStateRef State;
State = PrevState->invalidateRegions(Regions, Cond, BlockCount, LCtx, false);

Do you have any suggestions on what I have done wrong?

If there is no simple way to invalidate all three regions I will work on implementing something like the following:

A general scheme for widening, which can just be invalidation of anything touched within a loop. I think that can be done by having an algorithm that does an AST walk, and looks at all variables whose lvalues are taken (used to store values or otherwise get their address using '&') or anything passed-by-reference to a function. That set of VarDecls can be cached in a side table, mapping ForStmt*'s (other loops as well) to the set of VarDecls that are invalidated. Those could then be passed to something that does the invalidation using the currently invalidation logic we have in place. The invalidation logic should also handle checker state, which also needs to get invalidated alongside variable values.

To get the conservative invalidation that Anna suggests (actually, a little less conservative), I think you can just pass in a two MemRegions as the input to that method and get a new ProgramState with the appropriate regions invalidated.

Thank you for the suggestion. Unfortunately nothing seems to get invalidated when I call invalidateRegions, in the following code:

const StackFrameContext *STC = LCtx->getCurrentStackFrame();
MemRegionManager &MRMgr = svalBuilder.getRegionManager();
const MemRegion *Regions[] = {
  MRMgr.getStackLocalsRegion(STC),
  MRMgr.getStackArgumentsRegion(STC),
  MRMgr.getGlobalsRegion()
};
ProgramStateRef State;
State = PrevState->invalidateRegions(Regions, Cond, BlockCount, LCtx, false);

Do you have any suggestions on what I have done wrong?

I suspect this has to do with invalidateRegions itself. I will take a look. In the meantime, can you provide an updated patch that I can try out so I can step through the algorithm (if necessary) in the debugger and reproduce what you are seeing?

seaneveson updated this revision to Diff 33906.Sep 3 2015, 12:59 AM
seaneveson edited edge metadata.

Refactored into a new file: LoopWidening.cpp (and LoopWidening.h).
Added an analyzer-config option, which defaults to false: widen-constant-bound-loops=false
Modified analyzer-config tests to check for the new option.
Added code to invalidate the stack locals, stack arguments and global regions (WIP).
Added a test to check variables are unknown after a widened loop, which currently fails.

Hi Sean,

Ted provided more details off-list. He suspects that the problem is that we likely don't add MemSpaceRegions to the worklist because every region is a subregion of a MemSpaceRegion, and thus we would invalidate, by default, all regions that were in the same MemSpace as the regions we were invalidating. He thinks we want to not change that behavior, but also provide a way of invalidating an entire MemSpace if so desired. That's probably just a slight tweak to the algorithm.

I’ll take a look at your updated patch to reproduce what you are seeing and investigate to see if that is what is going on.

Thanks,
Devin

Hi Sean,

Ted provided more details off-list. He suspects that the problem is that we likely don't add MemSpaceRegions to the worklist because every region is a subregion of a MemSpaceRegion, and thus we would invalidate, by default, all regions that were in the same MemSpace as the regions we were invalidating. He thinks we want to not change that behavior, but also provide a way of invalidating an entire MemSpace if so desired. That's probably just a slight tweak to the algorithm.

I’ll take a look at your updated patch to reproduce what you are seeing and investigate to see if that is what is going on.

Thanks,
Devin

Hi Devin,

That sounds good, thanks for looking at this. Can you let me know when you will be able to look at this? I am doing some work that builds on this patch's functionality and having this patch accepted would help.

Thanks,
Sean

dcoughlin edited edge metadata.Sep 11 2015, 8:14 AM

I'll look at this today. Thanks for your patience!

I looked at the behavior of invalidateRegions() under the patch. It looks like MemSpaceRegions are being added to the worklist but these regions don't have clusters associated with them so invalidating the regions themselves doesn't remove any bindings.

Ted: What would you think about extending the logic in GenerateClusters() that invalidates globals for function calls to handle this more general case? That logic already checks to see if a region is in a particular global MemSpaceRegion and -- if so -- adds it to the worklist.

We could, for example, add a new RegionAndSymbolInvalidationTraits (e.g., TK_InvalidateEntireMemSpace). Clients of invalidateRegions could set this trait on the MemSpace they want to invalidate and GenerateClusters() could add any region with a memory space that has this trait to the work list. Does this seem like a reasonable approach?

There is a patch to add memspace region invalidation in http://reviews.llvm.org/D12993.

seaneveson updated this revision to Diff 35216.EditedSep 21 2015, 3:02 AM
seaneveson edited edge metadata.

The TK_EntireMemSpace trait is now used when invalidating. The trait was added in http://reviews.llvm.org/D12993, thank you Devin for the patch.
Updated to the latest trunk.

nit: Please, use proper punctuation in the comments.

lib/StaticAnalyzer/Core/ExprEngine.cpp
1616 ↗(On Diff #35216)

Do we loose precision for loops that need to be executed exactly maxBlockVisitOnPath times?

lib/StaticAnalyzer/Core/LoopWidening.cpp
10 ↗(On Diff #35216)

This would allow to widen not just the constant bound loops!
I think we should widen any loop that we know we did not fully execute.

171 ↗(On Diff #35216)

iteration -> iteration

What do you mean by "approximate"? Note that people rely on the analyzer to track the exact values of variables. They expect it to know what the value is. It can be very confusing if the analyzer states that it knows the value of the variable and it is the wrong value.

I am concerned with changing the value of a single variable of the loop based on a syntactic check. Also, I am not sure there is a strong need for preserving the value of the index variable either.

test/Analysis/constant-bound-loops.c
1 ↗(On Diff #35216)

Should we test that the loop is still executed n times before it is widened?

174 ↗(On Diff #35216)

I think we should extend this test case.
Ex: what about heap, what about variables touched by the loop variables declared in the loop's scope?

dcoughlin added inline comments.Sep 21 2015, 10:38 PM
lib/StaticAnalyzer/Core/LoopWidening.cpp
149 ↗(On Diff #35216)

This doesn't seem quite right. Consider:

int i;
for (i = 0; i < 21; i += 3) {}
clang_analyzer_eval(i == 23);

The value of i should be 21 after the loop, but this code sets it to 23. And what happens if i starts at 1 instead of 0?

Another (ridiculous) case to consider:

for (i = 0; i < 21; i += 3) {
  if (random() % 2 == 1) {
     i = i * i;
  } else {
    i--;
  }
}

What are the possible values of i after the loop?

Thank you for your comments.

@zaks.anna wrote:

What do you mean by "approximate"?

I think @dcoughlin gave a perfect example in the following comment:

@dcoughlin wrote:

This doesn't seem quite right. Consider:

int i;
for (i = 0; i < 21; i += 3) {}
clang_analyzer_eval(i == 23);

The value of i should be 21 after the loop, but this code sets it to 23. And what happens if i starts at 1 instead of 0?

The patch only examines the loop condition to avoid having to consider any of the loop body. It approximates an exit value for the variable in the condition by evaluating an iteration with the variable at its maximum/minimum value. This is loosely based on the widening and narrowing described in this paper: http://dl.acm.org/citation.cfm?id=512973

@zaks.anna wrote:

Note that people rely on the analyzer to track the exact values of variables. They expect it to know what the value is. It can be very confusing if the analyzer states that it knows the value of the variable and it is the wrong value.

I am concerned with changing the value of a single variable of the loop based on a syntactic check. Also, I am not sure there is a strong need for preserving the value of the index variable either.

Since the goal of this patch is to get the analyzer to move past loops, I see no problem with invalidating the index variable. That said, I still think it is worth doing the approximation, as it identifies infinite loops.

Further work could then be done to set the value of the index variable after 'simple' loops, where the analyzer can be sure of the value. In addition to this, the detection of infinite loops could be improved to work for any loop. This would allow the widening to be applied to any non-infinite, non-exiting loop.

lib/StaticAnalyzer/Core/ExprEngine.cpp
1616 ↗(On Diff #35216)

Yes.

To be precise, the loss is for loops that need to be executed (maxBlockVisitOnPath - 1) times, because processCFGBlockEntrance generates a sink if blockCount >= maxBlockVisitOnPath.

With the default value for maxBlockVisitOnPath, a loop that iterated three times would be fully analyzed with widen=false, but would be unnecessarily widened if widen=true.

This could be fixed by (effectively) incrementing the value of maxBlockVisitOnPath, when the widening was enabled.

e.g. replacing the following line from processCFGBlockEntrance:

if (nodeBuilder.getContext().blockCount() >= AMgr.options.maxBlockVisitOnPath) {
...

With something like:

int blockLimit = AMgr.options.maxBlockVisitOnPath + 
                 (AMgr.options.shouldWidenConstantBoundLoops() ? 1 : 0);
if (nodeBuilder.getContext().blockCount() >= blockLimit) {
...

Does that seem resonable?

test/Analysis/constant-bound-loops.c
174 ↗(On Diff #35216)

Good point. I actually encountered a false positive while improving this case.

int *h_ptr = malloc(sizeof(int));
*h_ptr = 3;
for (int i = 0; i < 10; ++i) {} // WARNING: Potential leak of memory pointed to by 'h_ptr'

The value of h_ptr is invalidated, but I'm not sure about *h_ptr. Is it likely this warning is produced because the pointer is invalidated, but not the memory?

zaks.anna added inline comments.Sep 23 2015, 10:14 AM
lib/StaticAnalyzer/Core/ExprEngine.cpp
1616 ↗(On Diff #35216)

This would increment the value for all loops...

I think we should first answer the question below. Do we want to widen all loops, not just loops with "constant bounds"?

test/Analysis/constant-bound-loops.c
174 ↗(On Diff #35216)

Could you provide the full example? There is leak in the code segment above.

Sean,

One important difference between what you are proposing and what Cousot and Cousot describe in the paper you cite is that they don't drop coverage. They widen for termination on infinite-height lattices and their narrowing still maintains an over-approximation (it covers all potential paths through a loop -- and perhaps some extra ones as well). What the analyzer had before with constant-bound loops was an under-approximation: it would just stop exploring paths after a while

What you are proposing is neither an under-approximation nor an over-approximation because it will both drop coverage (e.g., the path in my simple example above where i is 21 after the loop) and add spurious paths (e.g., the path where i is 23). This is something that it is best to avoid when possible because it will cause the analyzer to both miss true positives and lead to false positives.

My initial approach was for the analyzer to have as much information as possible after the loop. This means there are cases where the information is incorrect. Future work would be to reduce these cases.

I believe your preferred approach is to have no inaccuracies after the loop, which can initially be achieved by providing less information. Further work would add more (guaranteed accurate) information. In this way the analyzer is naturally conservative when it isn't sure of something. I now agree that this is a better approach to take.

What do people think about me creating a new patch based on your feedback? The aim would be to widen any non-exiting loops by invalidation. The initial patch would be behind a flag and would use the TK_EntireMemSpace trait to conservatively invalidate 'everything' when a loop does not exit. It would then run through the loop body in the invalidated state, resulting in the analysis of all possible paths. The loop would then exit from the (now possibly false) loop condition, or a (now reachable) break statement. Loops that never exit regardless of any variables would be an exception; see my comment below for thoughts on handling infinite loops.

Future improvements would prevent unnecessary invalidation and calculate the values of modified variables (after the loop).

lib/StaticAnalyzer/Core/LoopWidening.cpp
10 ↗(On Diff #35216)

I agree that the goal should be to widen any loop that isn't fully executed, but we need to consider infinite loops, which might be falsely exited after widening. The way I see it, assuming the widening will be done by invalidating variables/memory, we could either:

  1. Only widen loops which definitely exit.
  2. Widen all loops unless they are definitely infinite (or widen them in a way which minimizes the number of infinite loops which then exit).
  3. Come up with a method which decides if a loops is infinite or not (but tolerate mistakes either way) and widen when we "think" the loop will exit. This is similar to the current approach for constant bound loops.

My current preference would be option 2. This is based on the assumption that loops are generally written to exit at some point, and if they aren't, they are unlikely to have code after them anyway. When it does not know which branch the program will go down, the analyzer's approach is to go down both. Similarly if the analyzer does not know whether the loop exits, it should optimistically go down the exit branch (IMO).

test/Analysis/constant-bound-loops.c
174 ↗(On Diff #35216)

Sorry, I meant to say the warning is there regardless of what comes after the loop. Here is a full test case:

typedef __typeof(sizeof(int)) size_t;
void *malloc(size_t);
void free(void *);

void test() {
    int *h_ptr = (int *) malloc(sizeof(int));
    for (int i = 0; i < 2; ++i) {} // No warning.
    for (int i = 0; i < 10; ++i) {}
    free(h_ptr);
}

It produces the following for me:

clang.exe -cc1 -analyze -analyzer-checker=core,unix.Malloc -analyzer-config widen-constant-bound-loops=true test.cpp
test.cpp:8:31: warning: Potential leak of memory pointed to by 'h_ptr'
    for (int i = 0; i < 10; ++i) {}
                              ^
1 warning generated.

What do people think about me creating a new patch based on your feedback?

I don't think you need to create a new review -- we can update this one and keep the history.

The aim would be to widen any non-exiting loops by invalidation. The initial patch would be behind a flag and would use the TK_EntireMemSpace trait to conservatively invalidate 'everything' when a loop does not exit. It would then run through the loop body in the invalidated state, resulting in the analysis of all possible paths. The loop would then exit from the (now possibly false) loop condition, or a (now reachable) break statement.

This sounds great to me. I think it shouldn't be too difficult to add coarse widening before the last iteration through the loop for both concrete-bounded loops and loops with symbolic bounds. This will be a very imprecise hammer. It will invalidate everything (heap, locals, globals). I think you can choose the widening location (i.e, a loop back edge) to make it possible to still learn from the guard condition. For example, for the following loop I think the analyzer should be able widen and still know that ‘j’ < ’N’ on the final, widened iteration through the loop and that ‘j’ >= ’N’ after the loop (assuming there is no ‘break’):

for (int j = 0; j < N; j++) { .. }

This will let us evaluate what kinds of improvements in coverage you get from not generating a sink but will almost certainly be too imprecise to turn on by default. I think a slightly tricky challenge here will be to handle nested loops without unnecessary sinks.

You could then make the widening more precise by invalidating only memory regions that a pre-analysis indicates may be modified by the loop. This pre-analysis can be relatively simple — invalidating the cone of influence of the DeclRefExprs in the the loop (as Ted suggested above.) With this implemented, we can evaluate precision and performance to decide whether to turn this on by default — but the hope is yes. If not, we may need to play with unsound tricks that leave less invalidated.

Loops that never exit regardless of any variables would be an exception; see my comment below for thoughts on handling infinite loops.

Generalized non-termination checking is a very hard problem and is not something analyzer is particularly well-suited for. While this would be an interesting avenue for the analyzer to pursue, I think it should be in separate checker.

seaneveson updated this revision to Diff 36739.Oct 7 2015, 7:36 AM

Updated to latest revision.
Change patch to widen all loops which do not exit.
Changed to widen on block entrance so the guard condition is accounted for (thanks for the suggestion).
Updated tests to reflect changes.

seaneveson marked 4 inline comments as done.Oct 7 2015, 7:56 AM

There are some issues which haven't been resolved yet:

  • There is a loss of precision for loops that need to be executed exactly maxBlockVisitOnPath times, as the loop body is executed with the widened state instead of the last iteration.
  • The invalidation still causes memory leak false positives (see failing test: pointer_doesnt_leak_from_loop).
seaneveson retitled this revision from [Analyzer] Handling constant bound loops to [Analyzer] Widening loops which do not exit.Oct 8 2015, 12:49 AM

There is a loss of precision for loops that need to be executed exactly maxBlockVisitOnPath times, as the loop body is executed with the widened state instead of the last iteration.

I think this is an acceptable loss of precision because, in general, it is unlikely that a concrete-bounded loop will be executed *exactly* maxBlockVisitOnPath times. You could add special syntactic checks to try to detect this, but I think it is unlikely to make much different in practice.

lib/StaticAnalyzer/Core/LoopWidening.cpp
64 ↗(On Diff #36739)

Passing true here instead of false for CausedByPointerEscape should fix the false alarm. This will cause invalidation to tell the MallocChecker that the pointer has escaped and so shouldn't be treated as a leak.

test/Analysis/loop-widening.c
3 ↗(On Diff #36739)

There seems to be some weirdness when using _Bool in the prototype (vs. int). It adds an extra != 0 to symbolic expressions, which can result in clang_analyzer_eval() yielding UNKNOWN when using extern void clang_analyze_eval(int) would print TRUE or FALSE.

We should track this down and fix it, but for now I think it is better to use extern void clang_analyze_eval(int).

13 ↗(On Diff #36739)

This is clever.

111 ↗(On Diff #36739)

After changing the clang_analyzer_eval() prototype to take an int, you can use clang_analyzer_eval(i >= 50); // expected-warning {{TRUE}}.

158 ↗(On Diff #36739)

I think it would be good to add some nested loop tests. For example:

void nested1() {
  int i = 0, j = 0;
  for (i = 0; i < 10; i++) {
    clang_analyzer_eval(i < 10); // expected-warning {{TRUE}}
    for (j = 0; j < 2; j++) {
      clang_analyzer_eval(j < 2); // expected-warning {{TRUE}}
    }
    clang_analyzer_eval(j >= 2); // expected-warning {{TRUE}}
  }
  clang_analyzer_eval(i >= 10); // expected-warning {{TRUE}}
}

void nested2() {
  int i = 0, j = 0;
  for (i = 0; i < 2; i++) {
    clang_analyzer_eval(i < 2); // expected-warning {{TRUE}}
    for (j = 0; j < 10; j++) {
      clang_analyzer_eval(j < 10); // expected-warning {{TRUE}}
    }
    clang_analyzer_eval(j >= 10); // expected-warning {{TRUE}}
  }
  clang_analyzer_eval(i >= 2); // expected-warning {{TRUE}}
}
seaneveson updated this revision to Diff 37462.Oct 15 2015, 2:26 AM

Set CausedByPointerEscape to true
Check if the loop has already been exited before widening
Changed tests to use void clang_analyze_eval(int)
Added variable bound loop tests
Added nested loop tests

seaneveson marked 3 inline comments as done.Oct 15 2015, 2:33 AM

I think this is an acceptable loss of precision because, in general, it is unlikely that a concrete-bounded loop will be executed *exactly* maxBlockVisitOnPath times. You could add special syntactic checks to try to detect this, but I think it is unlikely to make much different in practice.

I agree.

The problem can be solved by executing the widened state after the last iteration, rather than instead of it. This can only really be done by changing the max block visit approach. This needs to be done in the future in order to support nested loops. This precision issue could also be resolved at that time.

include/clang/StaticAnalyzer/Core/PathSensitive/CoreEngine.h
84 ↗(On Diff #37462)

The BlockCount is specific to the path of analysis, so it can't be used to check if a loop has been exited before. I've added this set to store and identify previously visited blocks. Is there a better way this can/could be done?

lib/StaticAnalyzer/Core/LoopWidening.cpp
65 ↗(On Diff #37462)

Perfect, thanks.

Perhaps it could be changed back to false when the invalidation is better? Then it could catch true positives e.g.

int *a = new int[10]
for (int i = 0; i < 100; ++i) {
  if (i > 90) 
    ++a;
  ...
}
test/Analysis/loop-widening.c
159 ↗(On Diff #37462)

Thank you for the example tests.

When an inner loop is widened, any outer loops are widened 'coincidentally' since everything is invalidated. This means the second example test passes. When an inner loop is not widened the outer loop will not be widened. This is because a sink is generated when the max block count is reached the second time through the outer loop. i.e.

for (i = 0; i < 10; i++) {
  for (j = 0; j < 2; j++) {
    // A sink is generated here on the second iteration of the outer loop,
    // since this block was already visited twice (when i == 0).
    ...
  }
}
// Never reach here

This means the first example test does not pass. I’ve added the tests with a FIXME comment.

Hi Sean,

Sorry it took me so long to get back to you.

include/clang/StaticAnalyzer/Core/AnalyzerOptions.h
266 ↗(On Diff #37462)

Is this field used?

lib/StaticAnalyzer/Core/ExprEngine.cpp
1407 ↗(On Diff #37462)

What is the purpose of checking whether the block has already been visited by some other path?

If I understand correctly, this will stop the analyzer from widening before the "last" iteration through the loop and so will result in a sink after that iteration. What this means is that we will never explore the code beyond the loop in the widened state -- but isn't this the whole point of the widening?

So, for example, in your variable_bound_exiting_loops_not_widened() test, don't we want the clang_analyzer_eval() statement after the loop to be symbolically executed on 4 separate paths? That is, once when i is 0, once when i is 1, once when i is 2 and once when i is $conj_i$ + 1 where $conj_i$ is the value conjured for i when widening.

Also, in general, analysis of one path should not depend at all on whether another path has been explored. This is because we want the analyzer to be free choose different strategies about path exploration (e.g., BFS vs. DFS, prioritizing some paths over others, etc.) without changing the issues reported on along any given path. For this reason, I don't think we necessarily want to track and expose blockWasVisited() on CoreEngine or use this to determine when to permit a sink.

lib/StaticAnalyzer/Core/LoopWidening.cpp
98 ↗(On Diff #37462)

I get a warning here about comparing a signed int (RegionIndex) to an unsigned (NumRegions).

I think you can avoid this and simplify things by using a range-based for loop:

const MemRegion *Regions[] = {
    ...
};
RegionAndSymbolInvalidationTraits ITraits;
for (auto *Region : Regions) {
  ...
}
seaneveson marked an inline comment as done.Oct 26 2015, 9:36 AM

Hi Devin,

Sorry it also took me so long to get back to you.

include/clang/StaticAnalyzer/Core/AnalyzerOptions.h
266 ↗(On Diff #37462)

No. Thanks I'll fix that.

lib/StaticAnalyzer/Core/ExprEngine.cpp
1407 ↗(On Diff #37462)

I was trying to avoid doing unnecessary invalidation, where the variable bound loop had already exited. I suppose this won’t be a concern when the invalidation is improved. If everyone is happy for loops that have already exited to be widened then I will remove the related changes.

lib/StaticAnalyzer/Core/LoopWidening.cpp
98 ↗(On Diff #37462)

Will do. Thanks.

seaneveson updated this revision to Diff 38519.Oct 27 2015, 3:44 AM

Removed checking for already exited loops
Addressed Comments

seaneveson marked 5 inline comments as done.Oct 27 2015, 3:57 AM
seaneveson added inline comments.
test/Analysis/loop-widening.c
160 ↗(On Diff #38519)

The inner loop will now be widened in the first example test, which 'coincidentally' widens the outer loop as well. This means this text now passes.

dcoughlin accepted this revision.Oct 28 2015, 4:25 PM
dcoughlin edited edge metadata.

LGTM. Please commit. Thanks for tackling this, Sean!

seaneveson updated this revision to Diff 38717.Oct 29 2015, 3:00 AM
seaneveson edited edge metadata.

Updated to latest revision

This revision was automatically updated to reflect the committed changes.