Page MenuHomePhabricator

[analyzer] Add yaml parser to GenericTaintChecker
Needs ReviewPublic

Authored by boga95 on Mar 19 2019, 12:29 PM.



Parse the yaml configuration file and store the results in static variables. The user can define taint propagation rules, custom sinks, and filter functions.

Diff Detail

rC Clang

Event Timeline

boga95 created this revision.Mar 19 2019, 12:29 PM

I think the idea is awesome, thanks!

We really need to not use UINT_MAX, and I think static fields and member functions are a bit overused in this patch -- can't we just make getConfiguration and parseConfiguration static non-member functions, and make them take GenericTaintChecker as an argument?


We should definitely change these, not only is the large integer number impossible to remember, but this value could differ on different platforms.

boga95 marked an inline comment as done.Mar 25 2019, 2:35 PM

Why is it better not to use static functions/variables? Has it any performance impact?


I tried to use int, but I got a lot of warnings because of the getNumArgs() returns an unsigned value.

Why is it better not to use static functions/variables? Has it any performance impact?

I don't think so, I just think that it's easier to follow what's happening with a non-static field. Also, I just don't see why it needs to be static, you only use GenericTaintChecker::Custom* where you could easily access it. Especially GenericTaintChecker::getConfiguration -- if it wasn't static, you wouldn't even need to take the checker as a parameter.

Szelethus added inline comments.Mar 26 2019, 5:12 AM

What warnings? I thought we have -Wsign-conversion disabled.

boga95 updated this revision to Diff 193175.Apr 1 2019, 1:59 PM
boga95 marked an inline comment as done.

We agreed to disagree on the static function stuff -- it doesn't really matter, and I don't insist. I have no objections against this patch, great job! I won't formally accept to make it stand out a little more. Thanks!

Sorry, I rushed my earlier comment -- I definitely think that we should get rid of the UINT_MAX thing before landing this.

NoQ added a comment.Apr 12 2019, 5:55 PM

This new approach is clearly useful to other checkers as well, not only the Taint checker. I believe we should strongly consider generalizing it somehow, it's just too awesome to restrict to a single checker.

There's also a closely related technology called "API Notes" that allows you to inject annotations into system headers by shipping .yaml files with the compiler. It was actively pushed for for the Analyzer in particular but ended up never getting upstreamed:

I suspect that it'll make perfect sense for your use case to define a clang attribute for taint sources and sinks and then define an API notes definition that'll inject this attribute into headers. If only API notes were available, this would have been a perfect use case for them. I think it might be a good idea to ping the mailing lists about API notes one more time, announce that you *are* going for a similar technology anyway, and see if anybody suggests something.


  • The attributes/API notes solution is very comfy because it both allows users to address their false positives / false negatives by adding annotations in their source *and* allows us to annotate system headers via yaml files, and (if i understand correctly) both sorts of annotations are accessible uniformly via Decl->getAttr<...>().
    • If we decide to go for our own yaml format that doesn't work on top of attributes, we'll either get only one of these, or double up our implementation burden in checkers.
      • Implementation burden should not be that high, but it will be annoying.
    • If we don't want users to annotate their headers with source-level annotations, it sounds easier to go for a custom yaml parser, because defining attributes is annoying.
      • But i believe we do want them to in this case.

Does any of this make any sense within the bigger plans that you have for this checker?

Sorry for the late answer, I was working on my thesis which is about taint analysis. During that, I implemented several cool features (namespaces, std::cin, std::string, etc.) for the checker. I will share them soon.

I thought about the API notes and I think it will fit very well into the checker. If my understanding is clear, the checker would be configured with attributes and/or a yaml file which contains the attributes. Therefore, the implementation will become simpler, because the checker will only read the attributes. I made a draft for the possible usage of the attributes:

int mySource(); // The return value will become tainted

[[taint::src(0, 1)]] [[taint::dst(2)]]
void myPropagator(int*, int*, int*);

[[taint::src(0)]] [[taint::varDst(2)]]
int myFscanf(FILE*, const char*, ...); // The variadic arguments will become tainted, if the first argument is tainted

[[taint::dst(0, -1)]] [[taint::varSrc(2)]]
int mySprintf(char*, const char*, ... ); // The first argument and the return value will become tainted, if any of the variadic arguments is tainted

I think we can use the current yaml configuration in order to not block my progress. I think most of the current implementation can be reused for the API notes. I will be able to easily change the interface after the API notes are ready.


I got -Wsign-compare warnings, but it compiles. I will change it in the next review because that's contains the yaml file and the related tests.

boga95 updated this revision to Diff 199990.Fri, May 17, 2:07 AM
boga95 edited the summary of this revision. (Show Details)

I changed the parsing, therefore the return value index is represented by -1.
I added a test configuration file and parse it when testing to prove the configuration doesn't break the code.

NoQ added a comment.Thu, May 23, 1:26 PM

Ok! This looks like there isn't that much code, so it should be fine to duplicate. Do you have plans to eliminate the existing switch and instead use yaml for *all* supported functions while shipping a default .yaml file with the analyzer?

I'm still in doubts on how to connect your work with the CallDescription effort. I'll think more about that.


I think you don't need to make them static. By the time you need them, you already have access to the instance of the checker. Static variables of non-trivial types are frowned upon (


Let's re-use at least this function. I.e., abstract it out from GenericTaintChecker and put it into a header accessible by all checkers (dunno, lib/StaticAnalyzer/Checkers/Yaml.h).


Do we ever need to copy the config? Maybe make it a move-only type?

boga95 marked 3 inline comments as done.Thu, May 23, 1:44 PM

I already thought about it. It would make the code much cleaner, but it would have a little performance impact (Does it matter?).
It's straightforward to read the supported functions from another yaml file. Besides that, it can support multiple config files too.


Now, this is just for internal representation. The -1 value is mapped to this.

Szelethus set the repository for this revision to rC Clang.Tue, Jun 4, 5:55 PM

Generally looks good, some nits inline. Please mark comments you already solved as done.


What about the C++ way using numeric limits?


Is there a way to have only one type for argument vectors?


We tend to not write the braces in small cases like this.


Here Var stands for variadic I guess. I would prefer to avoid abbreviations as they might be misleading.

boga95 updated this revision to Diff 203339.Thu, Jun 6, 6:08 AM
NoQ added a comment.Thu, Jun 6, 9:26 PM
In D59555#1514602, @NoQ wrote:

I'm still in doubts on how to connect your work with the CallDescription effort. I'll think more about that.

I guess i'll just make a yaml reader for CallDescriptions as soon as the interface settles down a bit, and then propose you to switch to using it.


I think i'll softly advocate for a more centralized format that doesn't require every checker to implement an option for just that purpose.

Will you be happy with a global analyzer flag, eg. -analyzer-config api-yaml=/home/foo/analyzer.yaml and then:


with possibly multiple checkers in the same file? I guess we can change it later if you don't mind breaking flag compatibility.


I believe we should emit a compile error-like diagnostic here. One of the good things about compile errors would be that GUIs like scan-build would notify their users about compile errors in a friendly manner, while dumps to llvm::errs() will be completely ignored.