Page MenuHomePhabricator

[analyzer] Add yaml parser to GenericTaintChecker
Needs ReviewPublic

Authored by boga95 on Mar 19 2019, 12:29 PM.



Parse the yaml configuration file and store it in static variables. The user can define taint propagation rules, custom sink, and filter functions. E.g:

# A list of source/propagation function
  # int x = mySource1(); // x is tainted
  - Name:     mySource1
    DstArgs:  [4294967294] # Index for return value

  # int x;
  # mySource2(&x); // x is tainted
  - Name:     mySource2
    DstArgs:  [0]

  # int x, y;
  # myScanf("%d %d", &x, &y); // x and y are tainted
  - Name:     myScanf
    VarType:  Dst
    VarIndex: 1

  # int x; // x is tainted
  # int y;
  # myPropagator(x, &y); // y is tainted
  - Name:     myPropagator
    SrcArgs:  [0]
    DstArgs:  [1]

  # const unsigned size = 100;
  # char buf[size];
  # int x, y;
  # int n = mySprintf(buf, size, "%d %d", x, y); // If size, x or y is tainted
  # // the return value and the buf will be tainted
  - Name:     mySnprintf
    SrcArgs:  [1]
    DstArgs:  [0, 4294967294]
    VarType:  Src
    VarIndex: 3

# A list of filter functions
  # int x; // x is tainted
  # myFilter(&x); // x is not tainted anymore
  - Name: myFilter
    Args: [0]

# A list of sink functions
  # int x, y; // x and y are tainted
  # mySink(x, 0, 1); // It will warn
  # mySink(0, 1, y); // It will warn
  # mySink(0, x, 1); // It won't warn
  - Name: mySink
    Args: [0, 2]

Diff Detail

Event Timeline

boga95 created this revision.Mar 19 2019, 12:29 PM

I think the idea is awesome, thanks!

We really need to not use UINT_MAX, and I think static fields and member functions are a bit overused in this patch -- can't we just make getConfiguration and parseConfiguration static non-member functions, and make them take GenericTaintChecker as an argument?


We should definitely change these, not only is the large integer number impossible to remember, but this value could differ on different platforms.

boga95 marked an inline comment as done.Mar 25 2019, 2:35 PM

Why is it better not to use static functions/variables? Has it any performance impact?


I tried to use int, but I got a lot of warnings because of the getNumArgs() returns an unsigned value.

Why is it better not to use static functions/variables? Has it any performance impact?

I don't think so, I just think that it's easier to follow what's happening with a non-static field. Also, I just don't see why it needs to be static, you only use GenericTaintChecker::Custom* where you could easily access it. Especially GenericTaintChecker::getConfiguration -- if it wasn't static, you wouldn't even need to take the checker as a parameter.

Szelethus added inline comments.Mar 26 2019, 5:12 AM

What warnings? I thought we have -Wsign-conversion disabled.

boga95 updated this revision to Diff 193175.Mon, Apr 1, 1:59 PM
boga95 marked an inline comment as done.

We agreed to disagree on the static function stuff -- it doesn't really matter, and I don't insist. I have no objections against this patch, great job! I won't formally accept to make it stand out a little more. Thanks!

Sorry, I rushed my earlier comment -- I definitely think that we should get rid of the UINT_MAX thing before landing this.

NoQ added a comment.Fri, Apr 12, 5:55 PM

This new approach is clearly useful to other checkers as well, not only the Taint checker. I believe we should strongly consider generalizing it somehow, it's just too awesome to restrict to a single checker.

There's also a closely related technology called "API Notes" that allows you to inject annotations into system headers by shipping .yaml files with the compiler. It was actively pushed for for the Analyzer in particular but ended up never getting upstreamed:

I suspect that it'll make perfect sense for your use case to define a clang attribute for taint sources and sinks and then define an API notes definition that'll inject this attribute into headers. If only API notes were available, this would have been a perfect use case for them. I think it might be a good idea to ping the mailing lists about API notes one more time, announce that you *are* going for a similar technology anyway, and see if anybody suggests something.


  • The attributes/API notes solution is very comfy because it both allows users to address their false positives / false negatives by adding annotations in their source *and* allows us to annotate system headers via yaml files, and (if i understand correctly) both sorts of annotations are accessible uniformly via Decl->getAttr<...>().
    • If we decide to go for our own yaml format that doesn't work on top of attributes, we'll either get only one of these, or double up our implementation burden in checkers.
      • Implementation burden should not be that high, but it will be annoying.
    • If we don't want users to annotate their headers with source-level annotations, it sounds easier to go for a custom yaml parser, because defining attributes is annoying.
      • But i believe we do want them to in this case.

Does any of this make any sense within the bigger plans that you have for this checker?