This is an archive of the discontinued LLVM Phabricator instance.

[analyzer][NFC] Split CStringChecker to modeling and reporting
Needs ReviewPublic

Authored by steakhal on Jul 22 2020, 6:02 AM.

Details

Summary

This patch is an NFC. Mostly moving code segments here and there. Also a few renaming and minor refactorings.

Summary:

Introduces an API for interacting with the modeling part of the CStringChecker.
Using this API other checkers could query and potentially remove/invalidate information about the cstring length of an associated memory region.

This patch significantly reduces the complexity of the CStringChecker.
Introducing a modeling layer the hierarchy will look like this:

CStringModeling (infers cstring length from string literals, invalidates, updates, etc.)
\__ CStringChecker (checker with several filter options)
     \__ NullArg (filter option registered as a distinct checker)
     \__ BadSizeArg (same...)
     \__ OutOfBounds (same...)
     \__ many more...

It is questionable if we want to keep such a hierarchy or not.
Either way, that should be done in a different patch - I suppose.

Diff Detail

Event Timeline

steakhal created this revision.Jul 22 2020, 6:02 AM
steakhal set the repository for this revision to rG LLVM Github Monorepo.Jul 22 2020, 6:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2020, 6:04 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
NoQ added a comment.Jul 22 2020, 7:30 PM

Shared accessors look amazing.

If i understood correctly, you're splitting up the part which performs boring bookkeeping for the state trait from the part which models strlen() and other various functions. Such separation also looks amazing because ultimately the first part can be made checker-inspecific (i.e., a reusable half-baked trait that can be instantiated multiple times to track various things regardless of their meaning).

I don't think i understand having unix.cstring.CStringChecker as one more entry in Checkers.td. Do you expect there to be a situation when enabling CStringModeling without CStringChecker actually makes sense? If not, why not keep them agglutinated? That doesn't anyhow contradict the above purpose of having boring bookkeeping separate from actual API modeling.

In D84316#2168462, @NoQ wrote:

Shared accessors look amazing.

[...] you're splitting up the part which performs boring bookkeeping for the state trait from the part which models strlen() and other various functions.

Exactly.

Such separation also looks amazing because ultimately the first part can be made checker-inspecific (i.e., a reusable half-baked trait that can be instantiated multiple times to track various things regardless of their meaning).

Could you elaborate on the latter part? (instantiated multiple times...)

I don't think i understand having unix.cstring.CStringChecker as one more entry in Checkers.td. Do you expect there to be a situation when enabling CStringModeling without CStringChecker actually makes sense?

You seem to be right.
Enabling only the cstring modeling part does not make much sense without enabling at least the CStringChecker to model the cstring manipulation functions - even if the reporting is disabled of such functions.

If not, why not keep them agglutinated?

I wanted to have a separate class for bookkeeping while minimalizing the necessary changes.
What do you think would be the best way to organize this separation?

Few notes:

  • Checkers are implemented in the anonymous namespace, so only the given file has access to them.
  • I wanted to separate the bookkeeping logic from the reporting/function modeling logic in different files.
  • I like the fact that after the change the CStringChecker implements only the evalCall checker callback.

Let me know if I misunderstood something.

Szelethus marked an inline comment as done.Jul 23 2020, 6:31 AM

Yay! This checker has become a major headache as the analyzer grew. Not on a RetainCount scale, but on a MallocChecker one. Cheap shots, I know :)

The patch looks great! I have some miscellaneous comments, but nothing blocking.

I don't think i understand having unix.cstring.CStringChecker as one more entry in Checkers.td. Do you expect there to be a situation when enabling CStringModeling without CStringChecker actually makes sense?

You seem to be right.
Enabling only the cstring modeling part does not make much sense without enabling at least the CStringChecker to model the cstring manipulation functions - even if the reporting is disabled of such functions.

If not, why not keep them agglutinated?

I wanted to have a separate class for bookkeeping while minimalizing the necessary changes.
What do you think would be the best way to organize this separation?

Few notes:

  • Checkers are implemented in the anonymous namespace, so only the given file has access to them.
  • I wanted to separate the bookkeeping logic from the reporting/function modeling logic in different files.
  • I like the fact that after the change the CStringChecker implements only the evalCall checker callback.

Let me know if I misunderstood something.

Mind that that entry does a lot more then give a flag to the user -- It generates code for a lot of the checker machinery as well. Since CStringModeling still uses the checker callbacks to set up the proper string length, it is a necessity. (strong) Checker dependencies are the exact solution to the the problem where a checker cannot run without another (as I understand it, its not only about making sense, you really cant make CStringChecker work without CStringModeling).

clang/include/clang/StaticAnalyzer/Checkers/Checkers.td
429 ↗(On Diff #279787)

What other aspects of c strings needs to be modeled? Is it only length? If so, how about we rename the checker to CStringLengthModeling?

495 ↗(On Diff #279787)

I dug around a bit, and found this commit as to why this was needed: rGe56167e8f87acf87a9de3d383752e18a738cf056. So this dependency is appropriate.

clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
74

This is somewhat orthogonal to the patch, but shouldn't precondition violations be reported at preCall?

clang/lib/StaticAnalyzer/Checkers/CStringLength.cpp
175–181 ↗(On Diff #279787)

We traditionally put these on the bottom of the file -- I don't think this would upset the structure too much :)

steakhal marked 4 inline comments as done.Jul 23 2020, 10:49 AM

[...] you really cant make CStringChecker work without CStringModeling

How should I fuse the CStringModeling and the CStringChecker together while splitting them up?

I mean, that would be the best if the CStringChecker would focus on modeling the related cstring functions while letting the CStringModeling do the bookkeeping.
I see some contradiction here.

clang/include/clang/StaticAnalyzer/Checkers/Checkers.td
429 ↗(On Diff #279787)

For now I think the cstring length is enough.
I'm not sure if we will want to have other properties as well.
You are probably right.

495 ↗(On Diff #279787)

Interesting.
I was just doing a search & replace though :)

clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
74

That is the current behavior.
We should consider in the future using preCall if we refactor so relentlessly.

clang/lib/StaticAnalyzer/Checkers/CStringLength.cpp
175–181 ↗(On Diff #279787)

I wanted to place the class definition as close as possible to the registration function.
I can move this though.

You could do it in the code, but if the modeling wouldn't be present from CStringModeling, CStringChecker wouldn't work properly. So you should make it a strong dependency, just as you did in this patch. My comment was mainly a response to @NoQ :)

clang/lib/StaticAnalyzer/Checkers/CStringLength.cpp
175–181 ↗(On Diff #279787)

Yeah, I see what you were going for, but I'd prefer to keep it down there still.

NoQ added a comment.Jul 23 2020, 9:02 PM

I wanted to have a separate class for bookkeeping while minimalizing the necessary changes.
What do you think would be the best way to organize this separation?

Few notes:

  • Checkers are implemented in the anonymous namespace, so only the given file has access to them.
  • I wanted to separate the bookkeeping logic from the reporting/function modeling logic in different files.
  • I like the fact that after the change the CStringChecker implements only the evalCall checker callback.

Let me know if I misunderstood something.

Mmm, none of these benefits sound like they outweigh confusing the cost of users with a new checker flag that can't even be used in any sensible way.

If you want separate files, just put the checker into a header and include it from multiple cpp files. A few checkers already do that - RetainCountChecker, MPIChecker, UninitializedObjectChecker. There's nothing fundamental about keeping checkers in an anonymous namespace.

NoQ added a comment.Jul 23 2020, 9:07 PM
In D84316#2168462, @NoQ wrote:

Such separation also looks amazing because ultimately the first part can be made checker-inspecific (i.e., a reusable half-baked trait that can be instantiated multiple times to track various things regardless of their meaning).

Could you elaborate on the latter part? (instantiated multiple times...)

Imagine something like re-using the state trait implementation between MallocChecker and StreamChecker because they both model "resources that can be deallocated twice or leaked" - regardless of the specific nature of these resources. These checkers can implement their own API modeling maps, escape rules, warning messages, maybe model additional aspects of their problems, but fundamentally they're solving the same problem: finding leaks and overreleases of resources. This problem should ideally be solved once. This is why i advocate for abstract, generalized, "half-baked" state trait boilerplate implementations that can be re-used across checkers.

steakhal updated this revision to Diff 280398.Jul 24 2020, 4:25 AM
  • Created the CStringChecker subdirectory holding the related files.
  • CStringChecker split up to 4 files:
    • CStringChecker.h: Declaration of the checker class.
    • CStringChecker.cpp: Definitions ONLY of the cstirng modeling functions.
    • CStringLength.h: The interface declaration of the cstring query/modification API (aka. API header).
    • CStringLengthModeling.cpp: Definitions of the bookkeeping part of the CStringChecker class AND the definitions of the API header.
  • Added the CStringChecker folder as an include directory for the analyzer library, to make it easier to include the API header.
Szelethus added a comment.EditedJul 24 2020, 4:37 AM

I personally preferred the previous diff, the reporting part of the checker and the string length modeling was separated far more cleanly.

In D84316#2171267, @NoQ wrote:

Mmm, none of these benefits sound like they outweigh confusing the cost of users with a new checker flag that can't even be used in any sensible way.

I think the new checker would be an ideal candidate for a hidden checker, not to mention that D78126+D81761 enforces it anyways with asserts. With that in mind, I don't see what we're giving up here.

If you want separate files, just put the checker into a header and include it from multiple cpp files. A few checkers already do that - RetainCountChecker, MPIChecker, UninitializedObjectChecker. There's nothing fundamental about keeping checkers in an anonymous namespace.

I agree that there is no magical gain from keeping them there. UninitializedObjectChecker, btw, doesn't peek out -- only some of the related infrastructure. With that said, I don't want to make an example out of RetainCountChecker -- the way it is structured is not in line, at least the way I see it, with the modern way to develop large checkers.

edit: Accidentally put a portion of my comment in the quote.

edit2: I realize I criticized RetainCountChecker's structure without actually pointing out the problems in it, I'll follow this up.

Szelethus added a comment.EditedJul 24 2020, 4:41 AM
In D84316#2171270, @NoQ wrote:

Imagine something like re-using the state trait implementation between MallocChecker and StreamChecker because they both model "resources that can be deallocated twice or leaked" - regardless of the specific nature of these resources. These checkers can implement their own API modeling maps, escape rules, warning messages, maybe model additional aspects of their problems, but fundamentally they're solving the same problem: finding leaks and overreleases of resources. This problem should ideally be solved once. This is why i advocate for abstract, generalized, "half-baked" state trait boilerplate implementations that can be re-used across checkers.

Big +1! I think there is a lot of smart in MallocChecker, and its far less confusing now then it used to be. It'd be worth exploring the merger of those checkers, but we should probably reserve this discussion for another time.

balazske added inline comments.
clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLength.h
43

I do not like that the get and set (CStringLength) functions are not symmetrical. I (and other developers) would think that the get function returns a stored value and the set function sets it. The getCStringLength is more a computeCStringLength and additionally may manipulate the State too. In this form it is usable mostly only for CStringChecker. (A separate function to get the value stored in the length map should exist instead of this Hypothetical thing.)

Thanks for checking this @balazske.

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLength.h
43

[...] get function returns a stored value and the set function sets it.

Certainly a burden to understand. It would be more appealing, but more useful?
The user would have to check and create if necessary regardless. So fusing these two functions is more like a feature.
What use case do you think of using only the query function? In other words, how can you guarantee that you will find a length for a symbol?

In this form it is usable mostly only for CStringChecker. (A separate function to get the value stored in the length map should exist instead of this Hypothetical thing.)

You are right. However, I want to focus on splitting parts without modifying the already existing API reducing the risk of breaking things.
You should expect such a change in an upcoming patch.

steakhal added inline comments.Jul 28 2020, 11:31 AM
clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLength.h
43

On second thought, It probably worth having a cleaner API to a slight inconvenience. If he feels like, still can wrap them.
I will investigate it tomorrow.

steakhal added inline comments.
clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLength.h
43

I made a separate patch for cleansing this API.
In the D84979 now these API functions will behave as expected.

The title is a little bit confusing because only the C-string size model is going to be separated and be accessible. Other than that as @NoQ pointed out we need lot more of these common-API-separation patches. It is a great starting point for the CStringChecker.

clang/lib/StaticAnalyzer/Checkers/CMakeLists.txt
146

Other common checker functionality folders and headers do not require extra CMake support long ago. I think when we need such support, we could define it later, so that you could revert this.

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLength.h
43

I (and other developers) would think that the get function returns a stored value and the set function sets it.

Developers should not believe the getters are pure getters. As a checker-writer point of view, you do not care whether the C-string already exist or the checker creates it during symbolic execution, you only want to get the C-string.

Think about all the Static Analyzer getters as factory functions, that is the de facto standard now. For example, when you are trying to get a symbolic value with getSVal(), for the first occurrence of an expression no SVal exist, so it also creates it.

With that in mind, @steakhal, could you partially revert the renaming related refactors of D84979, please?

In D84316#2187372, @19n07u5 wrote:

The title is a little bit confusing because only the C-string size model is going to be separated and be accessible.

Could you elaborate on why is the title not precise?
It seems that the modeling logic and the reporting logic will be separated:

  • modeling will be implemented in CStringLengthModeling.cpp
  • reporting will be implemented in CStringChecker.cpp (just as like it was before)

I just wanted a short (at most 80 char long) title, if you offer any better I would be pleased.


Other than that as @NoQ pointed out we need lot more of these common-API-separation patches. It is a great starting point for the CStringChecker.

Thanks. I'm thinking about making the checker cleaner - we will see.

clang/lib/StaticAnalyzer/Checkers/CMakeLists.txt
146

It would be easier to use the CStringLength.h header without specifying the complete path to it.
IMO #include "CStringChecker/CStringLength.h" is kindof verbose compared to simply using #include "CStringLength.h".
As of now, I'm sticking to this.

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLength.h
43

[...] As a checker-writer point of view, you do not care whether the C-string already exist or the checker creates it during symbolic execution, you only want to get the C-string.

I would have agreed with you - before I made the D84979 patch.
Now I believe if the interface can be implemented purely then it should be done so.

Think about all the Static Analyzer getters as factory functions, that is the de facto standard now.

We can always change them.

For example, when you are trying to get a symbolic value with getSVal(), for the first occurrence of an expression no SVal exist, so it also creates it.

I'm not really familiar with the internals of getSVal(), I'm gonna definitely have on that.
IMO getSVal() is a different beast compared to the functions declared in this header file.

With that in mind, @steakhal, could you partially revert the renaming related refactors of D84979, please?

I genuinly think that I'm on the right track.

If you don't mind, move further discussion about that to the corresponding revision.

Do I sense correctly that the only information CSrtingLengthModeling.cpp requires from the actual CStringChecker is a checker tag? Because if so, I think we should just separate them even more cleanly -- we could just make a CStringLengthModeling checker implement the checker callbacks in that cpp file and we wouldn't need to move CStringChecker to a header. Not that moving it there is fundamentally bad, but in this instance it seems like we're legalizing bloating checkers instead of separating them. Also, this patch seems to have a significant overlap with D84979 -- is this intentional?

Do I sense correctly that the only information CSrtingLengthModeling.cpp requires from the actual CStringChecker is a checker tag?

AFAIK yes.

[...] it seems like we're legalizing bloating checkers instead of separating them.

I agree. I would genuinely have a modeling checker, and a completely different checker using the modeled information, let's call it CStringcChecker.
Unfortunately, that approach was not really well received - probably needs further discussion.

Also, this patch seems to have a significant overlap with D84979 -- is this intentional?

It is. In fact, this patch prepares the code for it.
I wanted to separate large code motion changes (like this one) from API refactoring changes (like D84979 does).
So in some sense, these two patches are hand in hand. Have a look at them ;)

NoQ added inline comments.Sep 8 2020, 2:37 PM
clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringChecker.cpp
32–34

Why suddenly use arrow syntax here?

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringChecker.h
227

No NeWlInE aT eNd Of FiLe

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLengthModeling.cpp
309

Yes it is. It gets invoked during exploded graph dumps and it's an invaluable debugging facility.

I would like to discuss why don't we have a distinct checker managing the bookkeeping stuff of the CString lengths.
I just want a clean understanding and wide consensus about this.

Personally, I would still prefer the original version of this patch (+nits of course).

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringChecker.cpp
32–34

This way I could spare the full name of the class :D
I will use the qualified name instead.

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringChecker.h
227

I'm wondering why did clang-format not add this - I'm really surprised. Thanks.

clang/lib/StaticAnalyzer/Checkers/CStringChecker/CStringLengthModeling.cpp
309

A strange observation to note here.
In the implementation of the dump method, I use the provided NL and Sep parameters.
However, in the checker_messages of a State dump, Sep seem to be substituted with the empty string.
For example taint::printTaint just ignore the Sep parameter to possibly workaround this issue.