This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Improve `CallDescription` to handle c++ method.
ClosedPublic

Authored by MTC on Jun 11 2018, 7:39 AM.

Details

Summary

CallDecription can only handle function for the time being. If we want to match c++ method, we can only use method name to match and can't improve the matching accuracy through the qualifiers.

This patch add the support for QualifiedName matching to improve the matching accuracy.

Diff Detail

Repository
rC Clang

Event Timeline

MTC created this revision.Jun 11 2018, 7:39 AM
MTC added a comment.EditedJun 11 2018, 7:47 AM

The implementation is not complicated, the difficulty is that there is no good way to get the qualified name without template arguments. For std::basic_string::c_str(), its qualified name may be std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::c_str, it is almost impossible for users to provide such a name. So one possible implementation is to use std, basic_string and c_str to match in the std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::c_str sequentially.

MTC updated this revision to Diff 150758.Jun 11 2018, 7:56 AM

Remove useless header files for testing.

Having C++ support would be awesome!
Thanks for working on this!

While I do agree matching is not trivial with qualified names, this problem is already solved with AST matchers.

I wonder if using matchers or reusing the HasNameMatcher class would make this easier. That code path is already well tested and optimized.

MTC added a comment.Jun 11 2018, 7:13 PM

Having C++ support would be awesome!
Thanks for working on this!

While I do agree matching is not trivial with qualified names, this problem is already solved with AST matchers.

I wonder if using matchers or reusing the HasNameMatcher class would make this easier. That code path is already well tested and optimized.

Thank you for pointing out the direction, xazax.hun!

I will looking into the matchers and try to give a better solution.

MTC updated this revision to Diff 151166.Jun 13 2018, 7:48 AM
  • Use hasName matcher to match the qualified name.
  • Use the full name, like std::basic_string<char>::c_str instead of c_str to match the c_str method in DanglingInternalBufferChecker.cpp.
xazax.hun added inline comments.Jun 13 2018, 9:30 AM
lib/StaticAnalyzer/Checkers/DanglingInternalBufferChecker.cpp
31

I am not sure if this is the right solution in case of this check. We should track c_str calls regardless of what the template parameter is, so supporting any instantiation of basic_string is desired. This might not be the case, however, for other checks.

If we think it is more likely we do not care about the template arguments, maybe a separate API could be used, where we pass the qualified name of the class separately without the template arguments.
Alternatively, we could use matches name so the users could use regexps.

At this point I also wonder what isCalled API gives us compared to matchers? Maybe it is more convenient to use than calling a match. Also, isCalled API has an IdentifierInfo cached which could be used for relatively efficient checks.

@NoQ what do you think?

61–62

If we check for fully qualified names, this check might be redundant.

lib/StaticAnalyzer/Core/CallEvent.cpp
273

Doesn't match also traverse the subtree of the AST? We only need to check if that particular node matches the qualified name. I wonder if we could, during the traversal, find another node that is a match unintentionally.

NoQ added inline comments.Jun 13 2018, 3:18 PM
lib/StaticAnalyzer/Checkers/DanglingInternalBufferChecker.cpp
31

I agree that it's better to match the chain of classes and namespaces (in as much detail as the user cares to provide) and discard template parameters.

For example, i wish that a CallDescription that's defined as {"std", "basic_string", "c_str"} would be able to match both std::string::c_str() and std::__1::string::c_str().

MTC added inline comments.Jun 13 2018, 7:21 PM
lib/StaticAnalyzer/Checkers/DanglingInternalBufferChecker.cpp
31

Yea, checker writers may can't provide all the template arguments, and it's not convenient to use!

I will try to give a better solution!

61–62

Right, thanks!

lib/StaticAnalyzer/Core/CallEvent.cpp
273

Yea, this maybe a problem, I will test whether this is going to happen and give a fine-grained match. Thanks!

MTC updated this revision to Diff 152702.Jun 25 2018, 8:58 AM

Sorry for the long long delay, I was on the Dragon Boat Festival a few days ago.

This update has two parts:

  • Use the matchesName to match the AST node with the specified name, matchesName use regex to match the specified name.
  • Use std::vector to record the list of qualified name, user can provide information as much as possible to improve matching accuracy. But can also provide only the function name.

There are two remain issues:

  • There is possible match the wrong AST node, e.g. for the NamedDecl foo(), if the function body has the a::b::x, when we match a::b::X(), we will match foo() . Because I'm not familiar with ASTMatcher, my bad, I can't think of a perfect way to match only the specified nodes.
  • The std::vector to record the information provide by the users may be not the best data structure.
  • I'm not the regex expert, the regex used in this patch definitely has the chance to improve.

    And I am not good at English writing, if any comments are not very clear, please correct me. Thanks for the help!
george.karpenkov resigned from this revision.Jun 26 2018, 6:16 PM
MTC added a comment.Jun 29 2018, 7:29 PM

kindly ping!

NoQ added inline comments.Jul 2 2018, 4:27 PM
lib/StaticAnalyzer/Checkers/BlockInCriticalSectionChecker.cpp
68

I wish the old syntax for simple C functions was preserved.

This could be accidentally achieved by using ArrayRef<StringRef> instead of std::vector<StringRef> for your constructor's argument.

lib/StaticAnalyzer/Core/CallEvent.cpp
273–280

Uhm, i think we don't need to flatten the class name to a string and then regex to do this.

We can just unwrap the declaration and see if the newly unwrapped layer matches the expected name on every step, until all expected names have been found.

MTC added a comment.Jul 4 2018, 7:15 AM

Thanks for your review, NoQ!

lib/StaticAnalyzer/Checkers/BlockInCriticalSectionChecker.cpp
68

ArrayRef<StringRef> can't achieve this goal. The only way I can figure out is changing the declaration to the following form.

CallDescription(StringRef FuncName, unsigned RequiredArgs = NoArgRequirement, ArrayRef<StringRef> QualifiedName = None) {}

lib/StaticAnalyzer/Core/CallEvent.cpp
273–280

Is the way you mentioned above is similar NamedDecl::printQualifiedName()? Get the full name through the DeclContext chain. See https://github.com/llvm-mirror/clang/blob/master/lib/AST/Decl.cpp#L1511.

Or ignore namespace ,like std, just only consider the CXXRecordDecl? If so, dyn_cast<> might be enough.

In D48027#1142324, @MTC wrote:
  • There is possible match the wrong AST node, e.g. for the NamedDecl foo(), if the function body has the a::b::x, when we match a::b::X(), we will match foo() . Because I'm not familiar with ASTMatcher, my bad, I can't think of a perfect way to match only the specified nodes.

Hmm, instead of using the match function which will traverse the ast, we could use the HasNameMatcher class which can be invoked on only one node. That class is in an internal namespace though, so I think the best would be to ask the maintainers whether it is ok to use that class externally, or were whe should put the functionality we need.

MTC updated this revision to Diff 157499.Jul 26 2018, 9:36 AM

@xazax.hun Thanks for your tips! After some investigation, MatchFinder::match just traverse one ASTNode, that means match(namedDecl(HasNameMatcher())) and match(namedDecl(matchesName())) both not traverse children. So there are three ways to match the specified AST node.

  1. match(namedDecl(HasNameMatcher())) matches the full qualified name that contains template parameter and that's not what we want to see.
  2. match(namedDecl(matchesName())) uses llvm regex to match the full qualified name.
  3. Unwrap the declaration and match each layer, similar to HasNameMatcher::matchesNodeFullFast().

In this update, I use the third way to match the specified AST node. This is simpler than I thought.

In addition, add a redundant constructor that is only used to construct a CallDescription with one name, this avoids the modification of the existing checker, like BlockInCriticalSectionChecker.cpp. Any suggestions?

MTC added a comment.Aug 13 2018, 7:28 AM

kindly ping!

NoQ accepted this revision.Aug 15 2018, 12:49 PM

This looks great, thanks, this is exactly how i imagined it!

This revision is now accepted and ready to land.Aug 15 2018, 12:49 PM
xazax.hun added a comment.EditedAug 15 2018, 12:51 PM

Generally looks good, I only wonder if this works well with inline namespaces. Could you test? Probably it will.

MTC updated this revision to Diff 161217.Aug 17 2018, 5:24 AM
  • rebase
  • Since we have enhanced the ability of CallDescription, remove the helper method isCalledOnStringObject().
MTC added a comment.Aug 17 2018, 5:38 AM

Generally looks good, I only wonder if this works well with inline namespaces. Could you test? Probably it will.

Thank you for reminding me! Yea, this can handle inline namespaces.

However this approach has limit. Given the code below, we cannot distinguish whether the basic_string is user-defined struct or namespace. That's means when the user provide {"std", "basic_string", "append"}, we can only know the qualified name of the call sequentially contains std, basic_string, append. We don't know if these names come from RecordDecl or NamespaceDecl.

namespace  std {
  namespace basic_string {
    struct A {
      void append() {}
    };
  }
}

void foo() {
  std::basic_string::A a;
  a.append(); // Match
}

@rnkovacs What do you think? Can this approach meet InnerPointerChecker's needs?

MTC added a comment.Aug 21 2018, 4:23 AM

@xazax.hun What do you think?

xazax.hun accepted this revision.Aug 21 2018, 8:15 AM

Sorry for the delay, I think this is OK to commit.
As a possible improvement, I can imagine making it slightly stricter, e.g. you could only skip QualifiedNames for inline namespaces and the beginning. Maybe add support for staring with "" or :: to even disable skipping namespaces at the beginning?
But these are just nice to have features, I am perfectly ok with not having them or doing it in a followup patch.

Oh, and thanks for working on this, this improvement was long overdue, but everybody was busy with something else :)

rnkovacs accepted this revision.Aug 21 2018, 8:58 AM
In D48027#1203944, @MTC wrote:

However this approach has limit. Given the code below, we cannot distinguish whether the basic_string is user-defined struct or namespace. That's means when the user provide {"std", "basic_string", "append"}, we can only know the qualified name of the call sequentially contains std, basic_string, append. We don't know if these names come from RecordDecl or NamespaceDecl.

namespace  std {
  namespace basic_string {
    struct A {
      void append() {}
    };
  }
}

void foo() {
  std::basic_string::A a;
  a.append(); // Match
}

@rnkovacs What do you think? Can this approach meet InnerPointerChecker's needs?

I guess it is highly unlikely for someone to write namespaces like that, and if they do, I think they deserve to have them treated as a std::string.
Thanks, Henry, this is great!

MTC added a comment.Aug 22 2018, 6:28 AM

Sorry for the delay, I think this is OK to commit.
As a possible improvement, I can imagine making it slightly stricter, e.g. you could only skip QualifiedNames for inline namespaces and the beginning. Maybe add support for staring with "" or :: to even disable skipping namespaces at the beginning?
But these are just nice to have features, I am perfectly ok with not having them or doing it in a followup patch.

Thanks, Gábor!
I will land it first and do the improvement according to the mismatch case in the followup patch!

MTC edited the summary of this revision. (Show Details)Aug 22 2018, 6:30 AM
This revision was automatically updated to reflect the committed changes.
NoQ added a comment.EditedAug 22 2018, 1:16 PM

I guess it is highly unlikely for someone to write namespaces like that, and if they do, I think they deserve to have them treated as a std::string.

Yup. On the other hand, if we specify our method as {"basic_string", "c_str"} and they make a namespace basic_string { const char *c_str(); }, we are pretty likely to crash when we try to obtain this-value for the call that isn't a method. This is still not a big deal because it's still highly unlikely, but we've seen crash reports of this sort, and the easier our spec is, the more likely it becomes that somebody has a different thing with the same name. For example, if we want to model iterators and we specify {"begin"} without specifying the base class (so that all sorts of containers were covered), we still want to specify that we're looking for a method call and not for a global function call.

So i believe that one of the important remaining problems with CallDescription is to teach it to discriminate between global functions and methods. We can do it in a number of ways:

  1. Make a special sub-class for CallDescription depending on the sort of the function (too verbose),
  2. Tag all items in the list as "record" vs. "namespace" (too many tags),
  3. Dunno, tons of other boring and verbose approaches,
  4. Change our PreCall/PostCall callbacks to callback templates that let allow the user subscribe to specific sub-classes of CallEvent similarly to how he can subscribe to different kinds of statements in PreStmt<Class>/PostStmt<Class>, and then the user doesn't need to write any code to check it dynamically.
MTC added a comment.Aug 23 2018, 7:13 PM
In D48027#1209844, @NoQ wrote:

So i believe that one of the important remaining problems with CallDescription is to teach it to discriminate between global functions and methods. We can do it in a number of ways:

  1. Make a special sub-class for CallDescription depending on the sort of the function (too verbose),
  2. Tag all items in the list as "record" vs. "namespace" (too many tags),
  3. Dunno, tons of other boring and verbose approaches,
  4. Change our PreCall/PostCall callbacks to callback templates that let allow the user subscribe to specific sub-classes of CallEvent similarly to how he can subscribe to different kinds of statements in PreStmt<Class>/PostStmt<Class>, and then the user doesn't need to write any code to check it dynamically.

Personally, I prefer 4. It is more checker developer friendly!