This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Check ArraySubscriptExprs in ArrayBoundCheckerV2
AbandonedPublic

Authored by donat.nagy on May 12 2023, 6:44 AM.

Details

Summary

The checker alpha.security.ArrayBoundV2 was implemented with a check::Location callback, so it triggered on memory read or write operations (and checked memory operations that handled ElementRegions).

The generality of this solution was elegant, but I'm replacing it with an implementation based on check::PreStmt<ArraySubscriptExpr>, because that'll be a better base for implementing intuitive and detailed error reports.

The logic for simplifying symbolically handled inequalities is unchanged and the existing unit tests pass under then new implementation without changes. Moreover, I'm highlighting the imprevements with four unit tests that all fail before this commit and pass with this commit:

(1) The case test_indirect_use() shows that the new code places the bug report at the location where an array is indexed incorrectly, while the old implementation placed the bug report in use_ptr() where the bad ElementRegion was accessed. I think this new behavior is more intuitive and basically a prerequisite for producing high quality error reports.

(2) The case test_multidim_generic() shows that the old logic badly mishandled multidimensional arrays.

(3) The case test_field() reveals that the old logic didn't notice memory access that touched only a FieldRegion inside an erroneous ElementRegion.

(4) The case test_in_unknown_space() shows that the checker had ignored all underflows occuring in UnknownSpaceRegion (i.e. memory accessed via unknown pointers) even if they happened in a well-structured area.

Among these issues (1) needed the switch to check::PreStmt, (2) and (3) are more elegant with check::PreStmt, but could've been solved with check::Location, and (4) is an independent improvement.

Finally, this commit also adds the testcases test_multidim_zero() and test_undersized_region() which are currently disabled, but should be fixed and enabled by followup commits.

Diff Detail

Event Timeline

donat.nagy created this revision.May 12 2023, 6:44 AM
Herald added a project: Restricted Project. · View Herald Transcript
donat.nagy requested review of this revision.May 12 2023, 6:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2023, 6:44 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
donat.nagy added a comment.EditedMay 12 2023, 6:56 AM

After some thinking and discussion with @gamesh411 I decided that it'd be better to replace the callback used by this checker. This is a clean but rough draft of this concept; for a final version I'd consider:

  • adding a secondary callback that handles *ptr equivalently to ptr[0];
  • including a special case that it's valid to form an after-the-end pointer as &arr[N] where N is the length of the array;
  • including a check::Location callback that creates bug reports when these after-the-end pointers are dereferenced.

@steakhal What do you think about this design direction?

(Re-uploaded patch to apply clang-format changes and remove a stray temporary comment.)

I still struggle to see the motivation.
Please consider adding a test demonstraing a case when the diagnostic is not relevant or easily misunderstood by the users. Alternatively, give an evaluation of this change so that I can see it myself. In the end it would still make sense to have something to talk about some specifi case(s).

donat.nagy planned changes to this revision.May 15 2023, 11:27 AM

I started to create tests to demonstrate the differences from the baseline, and it turns out that my patch is handling multidimensional arrays incorrectly. I'll debug this issue and upload a fixed version of this patch.

donat.nagy edited the summary of this revision. (Show Details)

I managed to debug the issue that plagued my commit (see test_multidim_zero() for details, I think I'll fix it in a followup change); and I have uploaded a new version of this patch with testcases that highlight the improvements introduced by it.

@steakhal If you need any additional clarification, feel free to ask me.

donat.nagy added a comment.EditedMay 16 2023, 5:22 AM

By the way, I'm fed up with the hack that ElementRegion is used for three separate things ("real" array indexing, casts and pointer arithmetic). To fix this I'm thinking about introducing a subclass hierarchy where a base class ElementLikeRegion has three subclasses:

  • ElementRegion represents the smaller memory area of one element in an array,
  • CastRegion represents the same memory area, but with a different type, and
  • OffsetRegion represents the same memory area, but with a different starting point.

Most old references to ElementRegion could be replaced by references to ElementLikeRegion, but functions like stripCasts() would be able to distinguish between the subclasses and do the intuitive thing.

What do you think about this idea? Do you see any problem with it?

(By the way I'm not satisfied with these quickly picked class names -- feel free to suggest better ones!)

By the way, I'm fed up with the hack that ElementRegion is used for three separate things ("real" array indexing, casts and pointer arithmetic). To fix this I'm thinking about introducing a subclass hierarchy where a base class ElementLikeRegion has three subclasses:

  • ElementRegion represents the smaller memory area of one element in an array,
  • CastRegion represents the same memory area, but with a different type, and
  • OffsetRegion represents the same memory area, but with a different starting point.

Most old references to ElementRegion could be replaced by references to ElementLikeRegion, but functions like stripCasts() would be able to distinguish between the subclasses and do the intuitive thing.

What do you think about this idea? Do you see any problem with it?

Ah, yes. I agree that this is a real issue. However, it's going to be non-trivial to lift all uses. ATM I cannot see immediate blockers.
I can only encourage you to explore this direction and report back in an RFC on discourse.
That change should be an NFC change, and only afterward apply semantic improvements - that are likely uncovered by this proposal.

(By the way I'm not satisfied with these quickly picked class names -- feel free to suggest better ones!)

Yes, there are probably better alternatives. Let's push this aside for now.

I'll move this class hierarchy modification question into a discourse thread after collecting the relevant facts. (By the way, thanks for suggesting discourse, I was not aware of it!)

@steakhal What is your first impression about this commit? Is this the (or at least a) good direction? May I continue with creating additional improvements (e.g. better error messages) that depend on this commit?

I'll come back one or two weeks later.

NoQ added a comment.May 25 2023, 6:31 PM

By the way, I'm fed up with the hack that ElementRegion is used for three separate things ("real" array indexing, casts and pointer arithmetic). To fix this I'm thinking about introducing a subclass hierarchy where a base class ElementLikeRegion has three subclasses:

  • ElementRegion represents the smaller memory area of one element in an array,
  • CastRegion represents the same memory area, but with a different type, and
  • OffsetRegion represents the same memory area, but with a different starting point.

Most old references to ElementRegion could be replaced by references to ElementLikeRegion, but functions like stripCasts() would be able to distinguish between the subclasses and do the intuitive thing.

What do you think about this idea? Do you see any problem with it?

(By the way I'm not satisfied with these quickly picked class names -- feel free to suggest better ones!)

My usual response to this is, "region types were a mistake". Every single time when the program relies on region type, it is actually supposed to rely on either AST type behind the program point, or on the dynamic type map if runtime types are of interest. I believe the ideal solution is to remove all non-base regions (ElementRegion, FieldRegion, CXXBaseObjectRegion, CXXDerivedObjectRegion) entirely, and instead represent all single-location values as (base region, offset) pairs (where offset can be an arbitrary SVal of appropriate numeric type), and represent regions as (base region, begin offset, end offset) triples.

So for example the SymbolRegionValue{FieldRegion{VarRegion}} object reg_$0<int var->field> would be turned into a SymbolRegionValue{VarRegion, Offset} object reg_$0<int var @+32> where 32 is the bit offset of field field in the struct. We could preserve the end offset here as well, but the int part already tells us that it's 64 or something like that. Funny type punning should probably be handled by the surrounding expression instead; here we just need to capture what the memory read looked like when it produced that symbol.

Basically do whatever RegionStore already does right, everywhere else.

I have more thoughts/examples at https://github.com/llvm/llvm-project/issues/42709

I don't know whether my solution is easier to implement than your solution. My solution yields a much simpler system at the end (whereas your solution increases complexity), but in my case the changes can be pretty dramatic. Both solutions will be pretty hard because you'll have to untangle a lot of code built specifically to deal with the existing artificial ambiguities. One good thing about my solution is, you'd at least be able to consult runtime behavior as the ultimate source of truth; region types aren't part of the runtime behavior so there's no right or wrong way to handle them so they'll always be a source of confusion. But if I was to guess, I'd say your solution is probably more realistic to implement on a short time frame, and you might be able to make your work somewhat incremental as well, by introducing the new regions in more and more cases.

I'm replacing it with an implementation based on check::PreStmt<ArraySubscriptExpr>

I think this is a great call; there's no fundamental reason why this would be more precise, but this tightly ties the problem to source code (as opposed to low-level memory layout), which allows us to shape the entire analysis differently and communicate with the user more efficiently.

One straightforward implication is that this way we avoid dealing with multiplication/division when calculating offsets. This is particularly valuable when the offset is symbolic, so existing decomposition into RegionOffset basically gives up. Of course we could fix RegionOffset to correctly compute a symbolic offset value instead, with all the multiplications and divisions over symbolic expressions. Then the constraint solver will need to solve these multiplications/divisions. Of course we could improve the constraint solver to handle them better, but if we can avoid this entirely, why not? Then, after the solver gives us the answer, we have to explain the answer to the user. And then, again, this isn't unsolvable, but that explanation would be much more convoluted in terms of symbolic byte offsets, than in terms of the actual source-level array index.

Now, symbolic offsets are actually extremely important because I believe this is pretty much the only case where we can reach good false positive rate. A typical loop over an array will be modeled by assigning the loop counter to concrete values 0, 1, 2, 3 and simulating the loop. This is the root cause of many false positives because the assumption that the loop can definitely be executed exactly 0 (or exactly 1 or exactly 2 or exactly 3) times is deeply incorrect. However when the index is symbolic, it indicates that the index is not a loop counter, so these false positives aren't going to hurt us.

Finally, when it comes to reclaiming our chance of finding bugs in loops, I think you should try to take your patch to the next level: don't even warn at check::PreStmt<ArraySubscriptExpr>, warn at checkPreStmt<ForStmt>!! Where check::PreStmt<ForStmt> is not actually a thing we support yet, but you get the point, simulate it with check::BranchCondition or something. The point is to analyze the loop ahead of time, syntactically or in a flow-sensitive manner (just like the loop unrolling facility figures out if a loop is worth unrolling, or how clang-tidy's bugprone-infinite-loop finds infinite loops), identify access patterns to the array in the loop, reach a conclusion that the array will be accessed by the loop up to a certain bound, compute symbolic value of that bound, and verify that it's within array bounds immediately, before entering the loop! In this case it doesn't matter how imprecise our loop modeling is, you'll still be able to reduce the concrete loop unrolling problem into a symbolic constraint problem, which we have a much better chance of solving correctly beyond toy examples.

@NoQ Re: ElementRegion hacks: Your suggestion is very convincing, and I agree that while my idea would bring some clarity to some particular issues, overall it would just worsen the "chaotic heap of classes" situation. The only advantage of my solution is that it would be an incremental change; but it's only incrementing the amount of problems, so I think I won't continue working on it.

On the other hand, I'd be interested in participating in the implementation of your suggestion. Unfortunately I don't have enough capacity to just sit down and start coding it; but I could suggest this reorganization as a potential thesis topic for a student/intern. (Our team regularly gets interns from Eötvös Loránd University; if one of them is interested, then perhaps we could tackle this issue with me in a mentor / advisor role, and the student creating the bulk of the code reorganization commits. These are very far-fetched plans, but I'd be glad to see the cleanup that you suggested and I'll try to contribute what I can. Of course if you have a more concrete plan for doing this rewrite, then instead of intruding I'm happy to help with e.g. reviews.)

Re: this review:

One straightforward implication is that this way [checking PreStmt<ArraySubscriptExpr>] we avoid dealing with multiplication/division when calculating offsets.

Unfortunately this is not the case, because the size of the arrays (calculated by clang::ento::getDynamicExtent() in DynamicExtent.cpp) is still expressed in bytes ("CharUnits") and we need to multiply the index with sizeof(elemType) before we can compare it to this value. The distinguishing feature of ArrayBoundCheckerV2 is that it has some (ad-hoc but useful) logic for reasoning about some multiplications; that part is not changed by this commit.

Now that you mention it, I see that it would be nice to avoid bothering with byte offsets and multiplications in the simple case when we want to take a single int from an array of ints. To achieve this we'd need to introduce extent handling functions that measure the size of arrays in number of elements (of what kind?) instead of bytes; right now I don't know how difficult would it be to achieve that. (As far as I see the extent is usually either a constant [that can be readily divided], or it's a freshly conjured symbol... Are those symbols used for anything meaningful? Can we ever put an upper bound on them?) However this all should probably belong to a separate commit/review/discourse thread.

Re: PreStmt<ForStmt>: That's a very good idea, but I feel that it belongs to the "engine level" and it's mostly orthogonal to the behavior this checker. My short-term (or at most medium-term) goal is to move this checker out of alpha (fix bugs, squash common false positives, improve messages etc.) and here I accept the limitation that (like all the currently "stable" checkers) it's mostly limited to loop-less code. On a longer term it'd be interesting to work on improving the loop handling engine (or just adding a gimmicky "loop prediction" ability to this checker); but I don't have concrete plans that far ahead.

steakhal requested changes to this revision.Jun 7 2023, 6:40 AM

After some thinking and discussion with @gamesh411 I decided that it'd be better to replace the callback used by this checker. This is a clean but rough draft of this concept; for a final version I'd consider:

  • adding a secondary callback that handles *ptr equivalently to ptr[0];
  • including a special case that it's valid to form an after-the-end pointer as &arr[N] where N is the length of the array;
  • including a check::Location callback that creates bug reports when these after-the-end pointers are dereferenced.

@steakhal What do you think about this design direction?

Could you elaborate on this part?
To clarify my position, I think we can go with either approach. We can stay with the check::Location, or with the PreStmt route.
To me, the only important aspect is to not regress, or prove that the places where we regress are for a good reason, and in the end we provide more valuable reports to the user.
For example, we can sacrifice some TPs for dropping a large number of FPs. That seems to be a good deal, even if we "regress" on some aspect.


Anyway, I just checked the impact of this patch and it was so big that I rerun the measurement to be sure (thus the delay :D), but it didn't get any better.
There are a lot more reports if I apply this patch, which suggests to me that there is something wrong.

For instance, in this example, we currently don't report any warnings, but with the patch, we would have some. https://godbolt.org/z/sjcrfP8df
We also lose some reports like this: https://godbolt.org/z/8113nrYhe

Please investigate it and also do comparative checks on real codebases to make sure the change is aligned with the expectations.
Given that some of our users depend on the current behavior, we should ensure that no reports disappear or appear unless they are justified.

This revision now requires changes to proceed.Jun 7 2023, 6:40 AM
donat.nagy added a comment.EditedJun 7 2023, 11:43 AM

Thanks for the review and the testing!

[...] What do you think about this design direction?

Could you elaborate on this part?

In short, this review should've been a discourse thread about my "switch to the PreStmt route" plan: I wanted to ask for architectural / high-level opinions before working on the details. If you or other reviewers don't reject this plan outright, I'll flesh out the secondary features (e.g. recognizing that *(arr+5) is equivalent to arr[5]).

To me, the only important aspect is to not regress, or prove that the places where we regress are for a good reason, and in the end we provide more valuable reports to the user.

My goal is to turn this checker into a useful and stable (i.e. non-alpha) checker as soon as possible.

On this path eliminating FPs (and improving the messages) is a high priority goal, but I would freely accept losing TPs as long as the remaining ones are still enough to be useful. (Theoretical example: if the current code reports a certain unusual TP (with a spartan message), but it'd be difficult to give a good error message in that situation, then I'd prefer sacrificing that TP on the altar of quick progress. Once the checker is out of alpha, we can re-add support for non-essential cases like that.)

For instance, in this example, we currently don't report any warnings, but with the patch, we would have some. https://godbolt.org/z/sjcrfP8df

I have a suspicion that these issues might be caused by the heuristics that I'm using to guess the "meaning" of the ElementRegion layers (I marked them with an inline comment). I'll examine and troubleshoot them, but probably only during the next week, because right now I'm in the middle of another task (upstreaming the ericsson-internal BitwiseShift checker) and I'd like to reach a milestone on it (e.g. starting the open source review).

We also lose some reports like this: https://godbolt.org/z/8113nrYhe

I know that this commit does not handle "array indexing" that's expressed with the pointer dereferencing operator. Re-adding this feature is a very straightforward task, and nothing depends on it, so I'll only invest work into it if I see that the other, less clear steps of my plans are successful.

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
318–319

This is the heuristic that I'm talking about in the main comment block. Note that the new code only strips ElementRegion layers where the type is the same, while the old code stripped all ElementRegion layers (and thus produced crazy FNs on multidimensional arrays).

Okay. I think we are aligned. I'm still uncomfortable sacrificing reports for fast pace development. IMO if that is the goal, then a brand new alpha checker is the way forward.
You can be sure that no external users depend on it as its brand new. Unlike with this V2 checker.
If everything goes well, after a careful evaluation we could drop V1 and V2 in favor of V3.
However, if let's say in a year that goal is not reached, we should probably do some cleanup regardless, so we not make the situation worse by having 3 unfinished alpha oob checkers. That would be a shame.

WDYT about a new checker?

Hmm, I agree that a new checker might be better for the situation when we'd sacrifice a significant amount of TPs, but after reconsidering the situation I think that my concerns are mostly theoretical: if the current code reports an issue, then it's probably not too difficult to provide a concise and useful bug report for it. (And if the issue is an eldritch mess that cannot be described in a concise manner, then it's better to ignore it...)

I might still temporarily "sacrifice" TPs or other advantages during the initial steps of the code reorganization (to keep the commit size more manageable), but I'll pay attention to compensating these in later commits (which may be merged at the same moment as the earlier ones).

donat.nagy abandoned this revision.Aug 4 2023, 8:41 AM

I'm abandoning this commit because it amalgamates several unrelated changes and I think it'd be better to handle them separately:

  1. First, there is a very simple, independent improvement related to underflows and UnknownSpaces. I already created the separate commit D157104 for this (reviews are welcome ;) ).
  2. As the title of this commit says, I wanted to turn this checker into a Checker<check::PreStmt<ArraySubscriptExpr>> instead of a Checker<check::Location>. I'm still planning to do this transition in a separate commit because I feel that this will be needed to compose good warning messages and there are a few situations like the testcase test_field where the check::Location model produces counter-intuitive results. (When I implement this, I'll also ensure that *p and p->field are handled analogously to p[0], but perhaps these will be in a separate commit.)
  3. Finally there is the "simplify RegionRawOffsetV2::computeOffset and fix multidimensional array handling" change. This is the modification that was responsible for the multitude of false positives caused by this commit; and I don't have a quick fix for it (the engine abuses ElementRegion to record pointer arithmetic and I didn't find a clear way to distinguish it from real element access). On the short-term I think I'll need to accept that this checker will produce false negatives in situations when one element of a multi-dimensional array is over-indexed without overflowing the whole array (e.g. if arr is declared as int arr[5][5], then arr[1][10] over-indexes arr[1], but points inside the full area of the matrix); but fortunately this is not a fatal limitation (it only produces false negatives, multi-dimensional arrays are not too common).

I'm abandoning this commit because it amalgamates several unrelated changes and I think it'd be better to handle them separately:

  1. First, there is a very simple, independent improvement related to underflows and UnknownSpaces. I already created the separate commit D157104 for this (reviews are welcome ;) ).
  2. As the title of this commit says, I wanted to turn this checker into a Checker<check::PreStmt<ArraySubscriptExpr>> instead of a Checker<check::Location>. I'm still planning to do this transition in a separate commit because I feel that this will be needed to compose good warning messages and there are a few situations like the testcase test_field where the check::Location model produces counter-intuitive results. (When I implement this, I'll also ensure that *p and p->field are handled analogously to p[0], but perhaps these will be in a separate commit.)
  3. Finally there is the "simplify RegionRawOffsetV2::computeOffset and fix multidimensional array handling" change. This is the modification that was responsible for the multitude of false positives caused by this commit; and I don't have a quick fix for it (the engine abuses ElementRegion to record pointer arithmetic and I didn't find a clear way to distinguish it from real element access). On the short-term I think I'll need to accept that this checker will produce false negatives in situations when one element of a multi-dimensional array is over-indexed without overflowing the whole array (e.g. if arr is declared as int arr[5][5], then arr[1][10] over-indexes arr[1], but points inside the full area of the matrix); but fortunately this is not a fatal limitation (it only produces false negatives, multi-dimensional arrays are not too common).

I'd like to thank you your effort on this subject.