This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Model some library functions
ClosedPublic

Authored by NoQ on May 31 2016, 6:29 AM.

Details

Summary

I've put together a simple checker that throws no warnings, but models some library functions, which has already helped us to suppress some false positives in other checkers in our runs.

For pure functions, i chose the old evalCall() approach instead of the body-farm approach because i wanted to produce less state splits. For example, this checker produce a single exploded graph branch for ispunct()'s non-zero branch, when its argument is in range:

['!', '/'] U [':', '@'] U ['[', '\`'] U ['{', '~']

I'm not sure if there's a way to write this out with if's and produce less than 4 branches. (Do we have any plans on merging branches more aggressively during analysis?) Because these functions are pure, we'd hardly ever want to catch them with`evalCall()` again in another checker.

Additionally, this checker's brace-initializers for function specifications are quite short - of course they're limited to very simple cases - the list of these cases can be expanded though.

The checker doesn't seem to be noticeably degrading performance. Here's an example of a false positve squashed:

Here line is taken to be "", the line++ statement is executed at least once (by looking at the exploded graph; there's lack of "entering loop body" diagnostic piece because loop condition has complicated CFG, which is why it fails to highlight - a separate issue), and the analyzer fails to realize that isspace('\0') is false.

Diff Detail

Event Timeline

NoQ updated this revision to Diff 59050.May 31 2016, 6:29 AM
NoQ retitled this revision from to [analyzer] Model some library functions.
NoQ updated this object.
NoQ added reviewers: zaks.anna, dcoughlin.
NoQ added a subscriber: cfe-commits.
NoQ updated this object.May 31 2016, 6:35 AM
NoQ updated this object.
NoQ updated this object.May 31 2016, 6:37 AM

It is great to model more widely used functions! However I think the LibraryFunctionsChecker name might be a bit to broad, wouldn't something like StdCLibraryFunctions be more informative?

NoQ added a comment.Jun 2 2016, 1:08 AM

Yeah, good point, the "Std" part should definitely appear in the name, not sure about the "C" thing, as we could probably expand this checker to model some simple C++ functions as well (and then we'd make a separate checker section to move from unix. to cplusplus. or something like that, not sure maybe we'd need to reside in core. anyway).

zaks.anna edited edge metadata.Jun 21 2016, 2:05 PM

Thanks for the patch! Here are the comments, most of which are nits.

Could you add the high level description of what we are doing somewhere or maybe just describe the meaning of FunctionSpec since it encodes how functions are modeled.

Also, we should explain why we are not using BodyFarm somewhere in the comment.

lib/StaticAnalyzer/Checkers/LibraryFunctionsChecker.cpp
10

Please, list the functions.

27

Naming looks odd: maybe "OutOfRange" and "WithinRange"?

33

nit: "is Unsigned" -> "as Unsigned"
Please, use a typedef for the type as you are using it below in getArgType.

83

Are the types in FunctionSpec already canonical? If so, please, add a comment.

178

nit: If you could factor these out into separate helper functions, it would be easier to read. Lot's of nesting..

219

What happens when NewState == State? I guess addTransition would just not do anything, but maybe we should just make the intent explicit and not call it at all.

245

Replace "Normal" with a more descriptive name.

273

Looks like you might want to have the checking code as a member on FunctionSpec.

322

Let's explain what we are doing next. For example, "Let's initialize the FunctionSpec for the functions we are modeling."

Remove "NOTE:"

326

not clear where this is used or if it is used in the initialization at all.

337

Is every item in the range set used to bifurcate the state?

NoQ updated this revision to Diff 65248.Jul 23 2016, 12:14 PM
NoQ edited edge metadata.
NoQ marked 11 inline comments as done.

Renamed the checker as xazax.hun suggested, added a lot more comments, done with inline comments :)

zaks.anna added inline comments.Jul 23 2016, 2:55 PM
lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
191 ↗(On Diff #65248)

Do we need to talk about crashes when describing what this does?
Also, please, use oxygen throughout.

205 ↗(On Diff #65248)

We could either provide these APIs in CallEvent or at least have variants that return canonical types. Maybe we already do some of that?

445 ↗(On Diff #65248)

When can this go wrong? Are we checking if there is a mismatch between the function declaration and the call expression? It is strange that findFunctionSpec takes both of those. Couldn't you always get FunctionDecl out of CallExpr?

508 ↗(On Diff #65248)

you could also use /*NameOfTheField*/ convention to name the arguments if that would make this map more readable.

test/Analysis/std-library-functions.c
3 ↗(On Diff #65248)

Why are you not testing all of the functions?

NoQ updated this revision to Diff 65369.Jul 25 2016, 10:08 AM
NoQ marked 4 inline comments as done.
NoQ added inline comments.
lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
191 ↗(On Diff #65248)

Added more comments below.

205 ↗(On Diff #65248)

Maybe a separate commit? There are quite a few checkers from which the .getArgExpr(N)->getType() pattern could be de-duplicated, but i don't think many of them are interested in canonical types.

445 ↗(On Diff #65248)

Callee decl is path-sensitive information because functions can be passed around by pointers, as mentioned in the comment at the beginning of the function. Expanded the comment, added a test.

508 ↗(On Diff #65248)

I think compactness is worth it here, and specs are pretty easy to remember, imho. Added an example to the first spec to see how it looks and make it easier for the reader to adapt and remember, but i'm not quite convinced that verbosity is worth it here.

test/Analysis/std-library-functions.c
3 ↗(On Diff #65248)

I was too bored to generate tests for all branches of all functions (and if i auto-generate such tests, it defeats the purpose), but i added some of the more creative tests and covered at least some branches of all functions with them.

NoQ added inline comments.Jul 25 2016, 10:56 AM
lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
192 ↗(On Diff #65369)

Even though there are some doxygen-style comments in the checkers, i’ve never seen doxygen actually generate any docs for checker classes. Are they useful for IDE quick-hints only?

NoQ added a comment.Jul 27 2016, 6:42 AM

Answering myself: Hmm, so the only reason why MPI checker class appears in doxygen (http://clang.llvm.org/doxygen/classclang_1_1ento_1_1mpi_1_1MPIChecker.html) is because this class is not in anonymous namespace (as far as i understand, they needed to be multi-file for some reason). CheckerDocumentation says that every checker must be wrapped in anonymous namespace, except CheckerDocumentationChecker :)

I don’t really see a good reason for the library functions checker to be moved out of anonymous namespace or deserve a doxygen page - after all, it’s all in one file, and the docs are right in front of the reader’s eyes anyway. But maybe if this checker expands enough, we could expose its data structures into public use, and then they'd be worth documenting :)

dcoughlin added inline comments.Jul 27 2016, 3:47 PM
lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
509 ↗(On Diff #65369)

I disagree about compactness being valuable here. I think it is more important to intrinsically document the spec. These will be written once and read frequently. When they are written, they will copied from a previous example -- probably by someone who is not familiar with the code or the spec format.

Another possibility (not sure if it is the right one here) is to use macro tricks to define a simple DSL like Kulpreet did in the LocalizationChecker.cpp.

NoQ added inline comments.Jul 28 2016, 4:41 AM
lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
509 ↗(On Diff #65369)

These will be written once and read frequently.

If only it was so :))

Hmm. What do you think of the following format? Macros mostly expand to empty or (argument), but it should be more readable than the /*...*/ noise.

SPEC {
  FOR_FUNCTION("isalnum"),
  SPEC_DATA {
    ARGUMENT_TYPES { IntTy },
    RETURN_TYPE(IntTy),
    INVALIDATION_APPROACH(EvalCallAsPure),
    BRANCHES {
      BRANCH { // Boils down to isupper() or islower() or isdigit()
        RANGE {
          ARG_NO(0), RANGE_KIND(WithinRange),
          SET { SEG('0', '9') U SEG('A', 'Z') U SEG('a', 'z') }
        },
        RANGE {
          RET_VAL, RANGE_KIND(OutOfRange),
          SET { SEG(0, 0) }
        }
      },
      BRANCH { // The locale-specific branch.
        RANGE {
          ARG_NO(0), RANGE_KIND(WithinRange),
          SET { SEG(128, 255) }
        }
      },
      BRANCH { // Other.
        RANGE {
          ARG_NO(0), RANGE_KIND(OutOfRange),
          SET { SEG('0', '9') U SEG('A', 'Z')
                              U SEG('a', 'z') U SEG(128, 255)}
        },
        RANGE {
          RET_VAL, RANGE_KIND(WithinRange),
          SET { SEG(0, 0) }
        }
      }
    }
  }
},

Even though there are some doxygen-style comments in the checkers, i’ve never seen doxygen actually generate any docs for checker classes.
Are they useful for IDE quick-hints only?

I think it's useful to have consistent documentation format.

In D20811#497585, @NoQ wrote:

Answering myself: Hmm, so the only reason why MPI checker class appears in doxygen (http://clang.llvm.org/doxygen/classclang_1_1ento_1_1mpi_1_1MPIChecker.html) is because this class is not in anonymous namespace (as far as i understand, they needed to be multi-file for some reason). CheckerDocumentation says that every checker must be wrapped in anonymous namespace, except CheckerDocumentationChecker :)

I don’t really see a good reason for the library functions checker to be moved out of anonymous namespace or deserve a doxygen page - after all, it’s all in one file, and the docs are right in front of the reader’s eyes anyway. But maybe if this checker expands enough, we could expose its data structures into public use, and then they'd be worth documenting :)

It has been originally written as a large set of files. If you feel strongly about it, we could merge it into a single file. That makes sense to me. @Alexander_Droste, what do you think?

zaks.anna added inline comments.Jul 29 2016, 5:46 PM
lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
206 ↗(On Diff #65369)

Separate commit is fine. I'd provide both APIs in CallEvent.

test/Analysis/std-library-functions.c
4 ↗(On Diff #65369)

Ok.

It has been originally written as a large set of files. If you feel strongly about it, we could merge it into a single file. That makes sense to me. @Alexander_Droste, what do you think?

Hi,
I would still strongly prefer to keep them in separate files if possible. One of the headers (MPIFunctionClassifier.hpp)
also got moved to include/clang/StaticAnalyzer/Checkers, as it is needed by some MPI clang-tidy checks.
Is it really a problem if the checker comments are part of the Doxygen documentation? Further, I think that
the separation of concerns in form of distinct files might be valuable for people being new to the Clang Static Analyzer
framework, as the grouping of functionality is visible on a higher level of abstraction. Regardless, I would of course
accept if you prefer to merge the files into a single one, excluding the MPIFunctionClassifier.hpp header.

NoQ added a comment.Jul 30 2016, 12:46 PM

Is it really a problem if the checker comments are part of the Doxygen documentation?

Of course not :) I've been mostly thinking about the benefits of the anonymous namespace itself (cleaner global scope, no name collisions, but even these benefits are extremely minor).

NoQ updated this revision to Diff 71510.Sep 15 2016, 8:43 AM
NoQ marked an inline comment as done.

Added a huge amount of macros in order to improve readability of function specs.
Other inline comments should have been addressed before.

dcoughlin edited edge metadata.Sep 15 2016, 3:53 PM

Thanks for adding the macros. I've provided some feedback inline.

I think a good rule of thumb for readability is: suppose you are a maintainer and need to add a summary for a new function. Can you copy the the summary for an existing function and figure out what each component means so you can change it for the new function?

include/clang/StaticAnalyzer/Checkers/Checkers.td
383

I know you and Gábor already discussed this -- but shouldn't this be CStdLibraryFunctionsChecker or 'StdCLibraryFunctionsChecker'? Or is is your intent that both C and C++ standard libraries would be modeled by this checker?

lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
11 ↗(On Diff #71510)

"throw" --> "generate"

534 ↗(On Diff #71510)

Is "specification" the right term here? Or is this really a "summary"?

536 ↗(On Diff #71510)

'SPEC_DATA' doesn't seem to add much in terms of readability. Is it needed?

537 ↗(On Diff #71510)

The argument and return types seem like more a property of the function than than the summary. Why are they here and not with the function name?

540 ↗(On Diff #71510)

"Cases" seems more appropriate than "branches" (branching is an implementation detail).

542 ↗(On Diff #71510)

If I understand correctly, the first "argument" to branch describes the constraints on the function arguments and the second (if present) describes the resulting constraint on the return value when the argument constraint holds. Is there a way to make this apparent in the spelling of the summary?

As a straw proposal, what about renaming the first 'RANGE' to 'ARGUMENT_CONSTRAINT' and the second the 'RETURN_VALUE_CONSTRAINT'?

Or, more jargony, "PRECONDITION" and "POSTCONDITION"?

547 ↗(On Diff #71510)

Is it ever the case that this final 'RANGE" constrains anything other than the return value? If not, can 'RET_VAL' be elided?

554 ↗(On Diff #71510)

What is the motivation behind the use of geometric terms here (i.e., "SEG", "POINT")? Why not "INTERVAL" and "EXACT_VALUE"?

560 ↗(On Diff #71510)

Would you be opposed to 'UNION' instead of 'U'?

NoQ added a comment.Sep 16 2016, 12:49 AM

I think a good rule of thumb for readability is: suppose you are a maintainer and need to add a summary for a new function. Can you copy the the summary for an existing function and figure out what each component means so you can change it for the new function?

Seems i've written too many summaries to reliably use this rule :)

Could you have a look at another attempt?:

SUMMARY(isalnum, ARGUMENT_TYPES { IntTy }, RETURN_TYPE(IntTy),
        INVALIDATION_APPROACH(EvalCallAsPure))
  CASE // Boils down to isupper() or islower() or isdigit()
    PRE_CONDITION(ARG_NO(0), CONDITION_KIND(WithinRange))
      RANGE('0', '9')
      RANGE('A', 'Z')
      RANGE('a', 'z')
    END_PRE_CONDITION
    POST_CONDITION(OutOfRange)
      VALUE(0)
    END_POST_CONDITION
  END_CASE
  CASE // The locale-specific range.
    PRE_CONDITION(ARG_NO(0), CONDITION_KIND(WithinRange))
      RANGE(128, 255)
    END_PRE_CONDITION
    // No post-condition. We are completely unaware of
    // locale-specific return values.
  END_CASE
  CASE
    PRE_CONDITION(ARG_NO(0), CONDITION_KIND(OutOfRange))
      RANGE('0', '9')
      RANGE('A', 'Z')
      RANGE('a', 'z')
      RANGE(128, 255)
    END_PRE_CONDITION
    POST_CONDITION(WithinRange)
      VALUE(0)
    END_POST_CONDITION
  END_CASE
END_SUMMARY
include/clang/StaticAnalyzer/Checkers/Checkers.td
383

Hmm, i just realized what you guys were talking about :) The same checker cpp file and even the same checker object should probably produce different checker list entries here which would go into separate packages (cplusplus for C++ library functions, etc.). We could even split the specifications into different files, but the checker object would still be the same, defined in the same file. Will do.

lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
537 ↗(On Diff #71510)

Because this is where C++ initializer list syntax forces them to be. Hiding this detail is, as far as i see, only possible with the means of BEGIN_.../END_... macros (which isn't a big deal i think).

547 ↗(On Diff #71510)

Some summaries only have pre-conditions: "for this argument constraint, any return value is possible". We should also be able to support void functions, which have no return values.

I think this is much clearer! That said, now that I look at it with 'POSTCONDITION' alone I don't think it is clear that the provided value describes the return value. What do you think about renaming it 'RETURN_VALUE'? Or adding back the RET_VAL I asked you about removing before? :-)

Also: do you think CONDITION_KIND is needed? in PRECONDITION? Or can the bare kind be used like in POSTCONDITION?

NoQ added a comment.Sep 16 2016, 9:35 AM

That said, now that I look at it with 'POSTCONDITION' alone I don't think it is clear that the provided value describes the return value. What do you think about renaming it 'RETURN_VALUE'? Or adding back the RET_VAL I asked you about removing before? :-)

Hmm, what about

CONSTRAIN
  ARGUMENT_VALUE(0, WithinRange)
    RANGE('0', '9')
    RANGE('A', 'Z')
    RANGE('a', 'z')
  END_ARGUMENT_VALUE
  RETURN_VALUE(OutOfRange)
    VALUE(0)
  END_RETURN_VALUE
END_CONSTRAIN

?

Something i don't like here is that the word "value" is overloaded. Maybe rename the inner VALUE back to POINT?

Also: do you think CONDITION_KIND is needed? in PRECONDITION? Or can the bare kind be used like in POSTCONDITION?

I agree that it's ok to use the bare kind, because it's quite self-explanatory.

In D20811#544981, @NoQ wrote:

Hmm, what about

CONSTRAIN
  ARGUMENT_VALUE(0, WithinRange)
    RANGE('0', '9')
    RANGE('A', 'Z')
    RANGE('a', 'z')
  END_ARGUMENT_VALUE
  RETURN_VALUE(OutOfRange)
    VALUE(0)
  END_RETURN_VALUE
END_CONSTRAIN

?

"CONSTRAIN" is a verb. What is the direct object here? It seems to me that the thing being constrained is the return value, so it seems odd to have 'CONSTRAIN' around the conditions on the arguments.

Something i don't like here is that the word "value" is overloaded. Maybe rename the inner VALUE back to POINT?

I don't think the geometric metaphor of 'POINT' makes sense here, especially with 'RANGE' (which I think is very good). What is the analog of a range that has only a single element?

lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
547 ↗(On Diff #71510)

What does a postcondition on a void function mean in this context? Can you refer to argument values? Such as "If the the function terminates then it must have been the case that the first argument was in the rangy x..z even though we didn't know that going in? Is this useful?

Ping? Is there something blocking progress here? This functionality is very useful and almost done.

Thanks!

zaks.anna accepted this revision.Oct 12 2016, 9:55 PM
zaks.anna edited edge metadata.
This revision is now accepted and ready to land.Oct 12 2016, 9:55 PM
NoQ marked 9 inline comments as done.Oct 20 2016, 9:30 AM

I thought to give it a pause to take a fresh look at how to arrange the macro-hints in the summaries.

Maybe something like that:

CASE
  ARGUMENT_CONDITION(ARG_NO(0), OutOfRange)
    RANGE('0', '9')
    RANGE('A', 'Z')
    RANGE('a', 'z')
    RANGE(128, 255)
  END_ARGUMENT_CONDITION
  RETURN_VALUE_CONDITION(WithinRange)
    SINGLE_VALUE(0)
  END_RETURN_VALUE_CONDITION
END_CASE
lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
547 ↗(On Diff #71510)

No, i don't think this is useful. There are just timeless immutable symbols about which we learn something new on every branch.

If the function doesn't terminate on certain pre-conditons, then we can model it by never mentioning these pre-conditions in any of the branches (we don't use this trick anywhere yet - all functions listed here shall terminate in all cases).

This would have been useful if we start referring to the heap shape (eg. "if the value behind the pointer passed as second argument to the call was in range [1,10] before the call, then it would be equal to 42 after the call"), but we don't do that yet.

In D20811#575521, @NoQ wrote:

I thought to give it a pause to take a fresh look at how to arrange the macro-hints in the summaries.

Maybe something like that:

CASE
  ARGUMENT_CONDITION(ARG_NO(0), OutOfRange)
    RANGE('0', '9')
    RANGE('A', 'Z')
    RANGE('a', 'z')
    RANGE(128, 255)
  END_ARGUMENT_CONDITION
  RETURN_VALUE_CONDITION(WithinRange)
    SINGLE_VALUE(0)
  END_RETURN_VALUE_CONDITION
END_CASE

Looks great to me!

NoQ updated this revision to Diff 75446.Oct 21 2016, 10:12 AM
NoQ edited edge metadata.

Update the domain-specific language for function specs/summaries.

dcoughlin accepted this revision.Oct 21 2016, 10:39 AM
dcoughlin edited edge metadata.

This looks great!

lib/StaticAnalyzer/Checkers/StdLibraryFunctionsChecker.cpp
694 ↗(On Diff #75446)

I think think this comment should say 'lowercase'.

702 ↗(On Diff #75446)

Same here.

This revision was automatically updated to reflect the committed changes.