Page MenuHomePhabricator

[analyzer] Add yaml parser to GenericTaintChecker

Authored by boga95 on Mar 19 2019, 12:29 PM.



Parse the yaml configuration file and store the results in static variables. The user can define taint propagation rules, custom sinks, and filter functions.

Diff Detail


Event Timeline

boga95 created this revision.Mar 19 2019, 12:29 PM

I think the idea is awesome, thanks!

We really need to not use UINT_MAX, and I think static fields and member functions are a bit overused in this patch -- can't we just make getConfiguration and parseConfiguration static non-member functions, and make them take GenericTaintChecker as an argument?

73–75 ↗(On Diff #191369)

We should definitely change these, not only is the large integer number impossible to remember, but this value could differ on different platforms.

boga95 marked an inline comment as done.Mar 25 2019, 2:35 PM

Why is it better not to use static functions/variables? Has it any performance impact?

73–75 ↗(On Diff #191369)

I tried to use int, but I got a lot of warnings because of the getNumArgs() returns an unsigned value.

Why is it better not to use static functions/variables? Has it any performance impact?

I don't think so, I just think that it's easier to follow what's happening with a non-static field. Also, I just don't see why it needs to be static, you only use GenericTaintChecker::Custom* where you could easily access it. Especially GenericTaintChecker::getConfiguration -- if it wasn't static, you wouldn't even need to take the checker as a parameter.

Szelethus added inline comments.Mar 26 2019, 5:12 AM
73–75 ↗(On Diff #191369)

What warnings? I thought we have -Wsign-conversion disabled.

boga95 updated this revision to Diff 193175.Apr 1 2019, 1:59 PM
boga95 marked an inline comment as done.

We agreed to disagree on the static function stuff -- it doesn't really matter, and I don't insist. I have no objections against this patch, great job! I won't formally accept to make it stand out a little more. Thanks!

Sorry, I rushed my earlier comment -- I definitely think that we should get rid of the UINT_MAX thing before landing this.

NoQ added a comment.Apr 12 2019, 5:55 PM

This new approach is clearly useful to other checkers as well, not only the Taint checker. I believe we should strongly consider generalizing it somehow, it's just too awesome to restrict to a single checker.

There's also a closely related technology called "API Notes" that allows you to inject annotations into system headers by shipping .yaml files with the compiler. It was actively pushed for for the Analyzer in particular but ended up never getting upstreamed:

I suspect that it'll make perfect sense for your use case to define a clang attribute for taint sources and sinks and then define an API notes definition that'll inject this attribute into headers. If only API notes were available, this would have been a perfect use case for them. I think it might be a good idea to ping the mailing lists about API notes one more time, announce that you *are* going for a similar technology anyway, and see if anybody suggests something.


  • The attributes/API notes solution is very comfy because it both allows users to address their false positives / false negatives by adding annotations in their source *and* allows us to annotate system headers via yaml files, and (if i understand correctly) both sorts of annotations are accessible uniformly via Decl->getAttr<...>().
    • If we decide to go for our own yaml format that doesn't work on top of attributes, we'll either get only one of these, or double up our implementation burden in checkers.
      • Implementation burden should not be that high, but it will be annoying.
    • If we don't want users to annotate their headers with source-level annotations, it sounds easier to go for a custom yaml parser, because defining attributes is annoying.
      • But i believe we do want them to in this case.

Does any of this make any sense within the bigger plans that you have for this checker?

Sorry for the late answer, I was working on my thesis which is about taint analysis. During that, I implemented several cool features (namespaces, std::cin, std::string, etc.) for the checker. I will share them soon.

I thought about the API notes and I think it will fit very well into the checker. If my understanding is clear, the checker would be configured with attributes and/or a yaml file which contains the attributes. Therefore, the implementation will become simpler, because the checker will only read the attributes. I made a draft for the possible usage of the attributes:

int mySource(); // The return value will become tainted

[[taint::src(0, 1)]] [[taint::dst(2)]]
void myPropagator(int*, int*, int*);

[[taint::src(0)]] [[taint::varDst(2)]]
int myFscanf(FILE*, const char*, ...); // The variadic arguments will become tainted, if the first argument is tainted

[[taint::dst(0, -1)]] [[taint::varSrc(2)]]
int mySprintf(char*, const char*, ... ); // The first argument and the return value will become tainted, if any of the variadic arguments is tainted

I think we can use the current yaml configuration in order to not block my progress. I think most of the current implementation can be reused for the API notes. I will be able to easily change the interface after the API notes are ready.

73–75 ↗(On Diff #191369)

I got -Wsign-compare warnings, but it compiles. I will change it in the next review because that's contains the yaml file and the related tests.

boga95 updated this revision to Diff 199990.May 17 2019, 2:07 AM
boga95 edited the summary of this revision. (Show Details)

I changed the parsing, therefore the return value index is represented by -1.
I added a test configuration file and parse it when testing to prove the configuration doesn't break the code.

NoQ added a comment.May 23 2019, 1:26 PM

Ok! This looks like there isn't that much code, so it should be fine to duplicate. Do you have plans to eliminate the existing switch and instead use yaml for *all* supported functions while shipping a default .yaml file with the analyzer?

I'm still in doubts on how to connect your work with the CallDescription effort. I'll think more about that.

210–218 ↗(On Diff #199990)

I think you don't need to make them static. By the time you need them, you already have access to the instance of the checker. Static variables of non-trivial types are frowned upon (

294 ↗(On Diff #199990)

Let's re-use at least this function. I.e., abstract it out from GenericTaintChecker and put it into a header accessible by all checkers (dunno, lib/StaticAnalyzer/Checkers/Yaml.h).

316 ↗(On Diff #199990)

Do we ever need to copy the config? Maybe make it a move-only type?

boga95 marked 3 inline comments as done.May 23 2019, 1:44 PM

I already thought about it. It would make the code much cleaner, but it would have a little performance impact (Does it matter?).
It's straightforward to read the supported functions from another yaml file. Besides that, it can support multiple config files too.

73–75 ↗(On Diff #191369)

Now, this is just for internal representation. The -1 value is mapped to this.

Szelethus set the repository for this revision to rC Clang.Jun 4 2019, 5:55 PM

Generally looks good, some nits inline. Please mark comments you already solved as done.

52 ↗(On Diff #199990)

Is there a way to have only one type for argument vectors?

312 ↗(On Diff #199990)

We tend to not write the braces in small cases like this.

73–75 ↗(On Diff #191369)

What about the C++ way using numeric limits?

16 ↗(On Diff #199990)

Here Var stands for variadic I guess. I would prefer to avoid abbreviations as they might be misleading.

boga95 updated this revision to Diff 203339.Jun 6 2019, 6:08 AM
NoQ added a comment.Jun 6 2019, 9:26 PM
In D59555#1514602, @NoQ wrote:

I'm still in doubts on how to connect your work with the CallDescription effort. I'll think more about that.

I guess i'll just make a yaml reader for CallDescriptions as soon as the interface settles down a bit, and then propose you to switch to using it.

804–805 ↗(On Diff #203339)

I think i'll softly advocate for a more centralized format that doesn't require every checker to implement an option for just that purpose.

Will you be happy with a global analyzer flag, eg. -analyzer-config api-yaml=/home/foo/analyzer.yaml and then:


with possibly multiple checkers in the same file? I guess we can change it later if you don't mind breaking flag compatibility.

16–17 ↗(On Diff #203339)

I believe we should emit a compile error-like diagnostic here. One of the good things about compile errors would be that GUIs like scan-build would notify their users about compile errors in a friendly manner, while dumps to llvm::errs() will be completely ignored.

boga95 updated this revision to Diff 210073.Jul 16 2019, 5:45 AM
boga95 marked 11 inline comments as done.

Report diagnostic error in case of an invalid filename or an ill-formed yaml file.

Starting to look real good!

807 ↗(On Diff #210073)

We mark options that are not yet ready for used with InAlpha, rather then Released. I think it would be more fitting in this case!

Mind that if you'd like to list this option after that, you'd have to use

clang -cc1 -analyzer-checker-option-help-alpha
58 ↗(On Diff #210073)
302–303 ↗(On Diff #210073)

Do we emit an error for this case?

1 ↗(On Diff #210073)

Header blurb (licence stuff etc)!

1–2 ↗(On Diff #210073)

Could you please use line breaks here? I know it isn't your making, but if we're touching it anyways :^)

// RUN: %clang_analyze_cc1 -Wno-format-security -verify %s \
// RUN: \
// RUN:   -analyzer-checker=core \
// RUN: \
// RUN:   -analyzer-config \
// RUN:

And also a testcase for an incorrect checker option:

// RUN: not %clang_analyze_cc1 -verify %s \
// RUN: \
// RUN:   -analyzer-config \
// RUN: \
// RUN:   2>&1 | FileCheck %s -check-prefix=CHECK-INVALID-FILE

// CHECK-INVALID-FILE: (frontend): invalid input for checker option
// CHECK-INVALID-FILE-SAME:        that expects a valid yaml file

something like that.

boga95 updated this revision to Diff 210258.Jul 17 2019, 12:36 AM
boga95 marked an inline comment as done.
boga95 updated this revision to Diff 210260.Jul 17 2019, 12:38 AM
boga95 marked 2 inline comments as done.

Hmm, okay, so we convert -1 from the config file to UINT_MAX in the code, I like it!

I wrote a couple nits but they really are just that. In general, for each different error message, a different test case would be great.

805 ↗(On Diff #210260)

Okay, so how do I know what the format of that file is? How about we create another option (in a different revision) that dumps an example configuration with comments?

27 ↗(On Diff #210260)

Do we need this include?

89–92 ↗(On Diff #210260)

Leave this as is for now, but maaybe in the future we should just use an enum. What do you think?

304–306 ↗(On Diff #210260)

Could we have a test for this too? :)

1 ↗(On Diff #210260)

Have with with the same length as the rest :)

9 ↗(On Diff #210260)

Is this actually the reason why we have this file? We already have a YAML parser in LLVM, what's does this file do that the "default" parser doesn't?

14 ↗(On Diff #210260)

Hmmm, do we need this include?

16 ↗(On Diff #210260)
namespace clang {
namespace ento {
42–43 ↗(On Diff #210260)

And for this too?

48 ↗(On Diff #210260)
} // end of namespace clang
} // end of namespace ento
24 ↗(On Diff #210260)

Could you please add the rest of the error message?

NoQ accepted this revision.Jul 17 2019, 4:15 PM

Looks great to me once @Szelethus's comments are addressed. Thanks!!

34 ↗(On Diff #210260)

I suggest None for being more explicit and readable.

24 ↗(On Diff #210260)

I'd rather remove the rest of the error message. There's no need to duplicate something that the user has already written on the command line.

Or do we think like \escapes?

This revision is now accepted and ready to land.Jul 17 2019, 4:15 PM
Szelethus added inline comments.Jul 18 2019, 1:59 AM
48 ↗(On Diff #210260)

I mean, the other way around >.<

24 ↗(On Diff #210260)

I'm not sure what \escapes mean -- however, if we emit the filename in the error message, we should test it. Also, how many filenames has the user written in the comand line? Include paths for this and that, output file, inputfile, files to link against... I would very much prefer preserving the current error message.

boga95 updated this revision to Diff 210540.Jul 18 2019, 6:00 AM
boga95 marked 7 inline comments as done.
NoQ added inline comments.Jul 19 2019, 1:05 PM
24 ↗(On Diff #210260)

I'm thinking, like, the user is using zsh and writing file#1.txt instead of file\#1.txt.

Szelethus accepted this revision.Jul 21 2019, 4:37 AM

LGTM, thanks! Please mark inlines as done as you fix them. If you like the file description I proposed, I'm happy to commit this on your behalf (or you might as well apply for a commit access, since you have a track record of high quality patches already).

9 ↗(On Diff #210260)

I would appreciate some comments, it still isn't immediately obvious why we need yet another yaml file. How about

"This file defines convenience functions for handling YAML configuration files for checkers/packages".

boga95 updated this revision to Diff 212065.Jul 27 2019, 11:28 AM
boga95 marked 16 inline comments as done.
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2019, 6:37 AM
xiaobai added inline comments.

Is there a particular reason this is marked as delete? This present an issue because the Optional constructed in ento::registerGenericTaintChecker will use this constructor. I don't hit this issue with any recent version of clang (tried with 6, 7, and 8) but I am hitting it with clang 3.8 on the swift-ci ubuntu bots.